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ABSTRACT 

A well-established and fundamental insight in database the- 
ory is that negation (also known as complementation) tends 
to make queries difficult to process and difficult to reason 
about. Many basic problems are decidable and admit prac- 
tical algorithms in the case of unions of conjunctive queries, 
but become difficult or even undecidable when queries are 
allowed to contain negation. Inspired by recent results in fi- 
nite model theory, we consider a restricted form of negation, 
guarded negation. We introduce a fragment of SQL, called 
GN-SQL, as well as a fragment of Datalog with stratified 
negation, called GN-Datalog, that allow only guarded nega- 
tion, and we show that these query languages are compu- 
tationally well behaved, in terms of testing query contain- 
ment, query evaluation, open- world query answering, and 
boundedness. GN-SQL and GN-Datalog subsume a num- 
ber of well known query languages and constraint languages, 
such as unions of conjunctive queries, monadic Datalog, and 
frontier-guarded tgds. In addition, an analysis of standard 
benchmark workloads shows that most usage of negation in 
SQL in practice is guarded negation. 

1. INTRODUCTION 

A well-established and fundamental insight of database 
theory is that negation (also called complementation or dif- 
ference) tends to make queries difficult to reason about. Re- 
call that the unions of conjunctive queries are the first-order 
queries that can be expressed without using negation. Many 
basic problems are decidable and admit practical algorithms 
in the case of unions of conjunctive queries, but are unde- 
cidable in the case of arbitrary first-order queries. Examples 
include query containment and open world query answering. 
We argue that most queries in practice use only a re- 
stricted form of negation, which is called guarded nega- 
tion and was first considered in [t] (in the study of de- 
Detailed proofs of the results in this paper can be found in 
the appendix. We would like to thank Alkis Polyzotis for 
helpful comments. 



cidable fragments of first-order logic). By guarded nega- 
tion we mean that queries may involve negative con- 
ditions only if these conditions, intuitively, pertain to 
a single record in the database. For instance, if a 
database schema contains relations Author(AuthID,Name) 
and Book(AuthID,Title,Year, Publisher), the query that 
asks for authors that did not publish any book with Elsevier 
since "not publishing a book with Elsevier" is a property 
of an author. The query that asks for pairs of authors and 
book titles where the author did not publish the book, on the 
other hand, is not allowed, since it involves a negative con- 
dition (in this case, an inequality) pertaining to two values 
that do not necessarily co-occur in a record in the database. 
The requirement of guarded negation can be formally stated 
most easily in terms of the Relational Algebra: we allow the 
use of the difference operator Ei — E2 provided that Ei is a 
projection of a relation from the database. 

Based on an analysis of standard SQL benchmark work- 
loads, we argue that guarded negation covers most uses of 
negation in SQL in practice. Furthermore, building on re- 
cent results in logic and finite model theory [7 
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we show 
that queries with guarded negation are computationally very 
well behaved. For instance, query containment and open 
world query answering are decidable for first-order queries 
with guarded negation, and boundedness is decidable for the 
guarded-negation fragment of Datalog with stratified nega- 
tion. We also determine the complexity of query evaluation 
for queries with guarded negation, which (under reasonable 
complexity theoretic assumptions) is easier than the same 
problem for queries with unguarded negation. 

Our results show that guarded negation is a fruitful con- 
cept for databases, in the sense that it enables solving central 
decision problems in database theory more efficiently. We 
also believe that guarded negation is a fruitful concept from 
a more practical point of view, allowing for efficient query 
plans and query optimization strategies. This is something 
we are exploring in a separate line of investigation. 

Outline and Main Results. In Section [2] we review the 
definition of GNFO, guarded-negation first-order logic, and 
GNFF, guarded-negation fixed-point logic, as well as the 
main known decidability and complexity results for these 
logics [7]. We also provide an equivalent characterization of 
GNFO in terms of the Relational Algebra. 

In Section |3J we investigate what it means for an SQL 
query to be negation-guarded. Specifically, we identify syn- 
tactic restrictions on the use of negation in SQL queries, 
and we show that the first-order queries satisfying these re- 
strictions can be translated to GNFO, and, in fact, are ex- 



pressively complete for GNFO, in the sense of Codd's com- 
pleteness theorem. Furthermore, by means of an analysis 
of standard SQL benchmark workloads, we show that most 
SQL queries in practice satisfy the syntactic restrictions. 

In Section |4] similarly, we introduce a syntactic frag- 
ment of Datalog with stratified negation, called GN-Datalog, 
which admits a translation into GNFP. 

In SectionlSl we show that GN-SQL and GN-Datalog sub- 
sume a number of important existing query languages and 
constraint languages. In particular, GN-Datalog subsumes 
both monadic Datalog (which it extends by allowing IDBs 
of arbitrary arity, and negation, subject to guardedness con- 
ditions) and unions of conjunctive queries. 

In Section [6] we show that query containment is 
2ExpTime-complete for GN-SQL queries as well as for GN- 
Datalog queries (note that the decidability of these problems 
follows via translations into GNFO and GNFP). 

In Section [TJ we determine the complexity of query evalu- 
ation and open-world query answering for GN-SQL and for 
GN-Datalog. While the data complexity of query evalua- 
tion is in PTime, both for GN-SQL and for GN-Datalog, in 
terms of combined complexity, the problem is complete for 
the complexity class pNP[i°g'l (for GN-SQL) and P^^^ (for 
GN-Datalog). The data complexity of open world query an- 
swering for GN-SQL with respect to incomplete databases 
is coNP-complete. The problem can be solved in PTime for 
a considerable fragment of GN-SQL. 

In Section [SJ we prove decidability of the boundedness 
problem for GN-Datalog. Boundedness is a classical deci- 
sion problem in the study of query optimization for recur- 
sive queries. It is known to be undecidable for Datalog, but 
decidable for monadic Datalog. Our result can be viewed as 
a powerful generalization of the decidability of boundedness 
for monadic Datalog queries [17| . 

We conclude in Section |9] by discussing possibilities for 
further extending GN-SQL and GN-Datalog. 

2. PRELIMINARIES 

In this section, we review definitions and results concern- 
ing the guarded-negation logics GNFO and GNFP from u\ . 
These results will be put to extensive use in the rest of this 
paper. We assume familiarity with the basic syntax and 
semantics of first-order logic. 

For clarity, we will maintain a distinction between in- 
stances and structures: a structure has an associated do- 
main, which may be a superset of its active domain, and 
which may depend on the structure in question. Further- 
more, structures may interpret not only relation symbols 
but also constant symbols (which denote, not necessarily 
distinct, domain elements). Thus, instances may be viewed 
as a special case of structures, where the domain is the active 
domain and there are no constant symbols. Unless explicitly 
stated otherwise (by means of the adjective "unrestricted"), 
we always assume structures and instances to be finite. 

GNFO. Guarded Negation First-Order Logic (GNFO) is 
the fragment of first-order logic consisting of all formulas 
built up from atomic formulas (including equalities) us- 
ing conjunction, disjunction, existential quantification, and 
guarded negation, that is, negation in the specific form 
a A -^<j> where a is an atomic formula (possibly an equality 
statement) and all free variables of (j> occur in a. Note that, 
since the guard a is allowed to be an equality statement. 



we are essentially able to negate any formula with at most 
one free variable (by writing x = x f\ -n(j>(x)). Formally, the 
formulas of GNFO are generated by the recursive definition 

:— R{ti,...,tn) I ii = t2 I (^1 A (?!)2 I 01 V </>2 1 Jx(t> \ aA^(l) 

where each ti is either a variable or a constant symbol, and, 
in the last clause, a is an atomic formula containing all free 
variables of 0. Note that function symbols (of arity greater 
than zero) are not considered. 

In the above definition, we required a to be an atomic 
formula containing all free variables of the negated for- 
mula (j). Occasionally, it is convenient to allow a slightly 
more liberal syntax. Let us say that q is a generalized 
guard for (^ if a is a disjunction of existentially quantified 
atomic formulas such that the free variables of are in- 
cluded in the free variables of each disjunct. One could 
extend GNFO by allowing generalized guards in the defi- 
nition of guarded negation, thus admitting formulas such 
as {3uv R{x, y, u, v) V 3uv R{y, x, u, v)) A -^Sxy. This would 
not increase the expressive power of GNFO: if a negation 
is guarded by a generalized guard, we can "pull out" the 
disjunction and the existential quantification to obtain an 
equivalent formula without generalized guards (at the cost 
of a possibly exponential blow-up in foriuula size). In par- 
ticular, the above example can be equivalently expressed by 
3uv{R{x,y, u, v)A-<Sxy)\/3uv{R{y,x,u,v)A^Sxy). There- 
fore, for simplicity, our definition of GNFO does not allow 
for generalized guards. 

GNFP. Guarded Negation Fixed Point Logic (GNFP) fur- 
ther extends GNFO with an operator for least fixed points of 
positively definable monotone operations on relations. That 
is, we introduce second-order variables (also called fixed- 
point variables) of arbitrary arity, which may be used to 
form atomic formulas in the same way as ordinary relation 
symbols, and if is any GNFP formula, X an n-ary second- 
order variable (n > 1) occurring only positively in tf) (i.e, 
under an even number of negations), x = a;i, ...,!„ a se- 
quence of first-order variables, and t — ti, . . . ,tn a sequence 
of terms (first-order variables or constant symbols), and the 
free first-order variables of are included in x, then 

[LFPx,x aA0](t) 

is also a formula of GNFP, where a is a generalized guard 
for (f), i.e., a disjunction of existentially quantified atomic 
formulas (involving only atomic relations, no second-order 
variables), such that all free first-order variables of (p are 
also free variables of each disjunct of a. 

In the above formula, the LFP operator is a generalized 
quantifier binding the variables X and x. The formula ex- 
presses that the tuple t belongs to the least fixed-point of 
the monotone operation on relations defined hy a A (j)- Inci- 
dentally, here, unlike in the case of GNFO, it is important 
that a is allowed to be a generalized guard. 

In what follows, whenever we consider LFP formulas, we 
will always assume that they do not have any free second- 
order variables. The formal semantics of [LFPx.x a A <?i](y) 
is the familiar one from least fixed-point logic, cf. fl]. If the 
formula has at most one free variable x, we may omit the 
guard a, which can be assumed to be the equality statement 
x = X. For example, the GNFP formula 

[LFPx,. P{x) V 3y R{x,y) A X(y)]{z) 

says that there is an _R-path from z to some element in P. 



Definability of Greatest Fixed Points. Besides the above 
least fixed-point operator, we can consider an analogous op- 
erator GFP for taking the greatest fixed point of a definable 
monotone operation on relations. However, as it turns out, it 
is possible to define the GFP operator in terms of the LFP 
operator (and vice versa) using a dualization via guarded 
negation. Specifically, [GFPx,x Q(x)A0(x)](t) can be equiv- 
alently expressed as Q(t)A^[LFPx,xa(x)A-'(7i>'(x)](t), where 
(j>' is obtained from tf) by replacing all subformulas of the form 
X(t') by Q(t')A-'X(t'). For this reason, the above definition 
of GNFP does not include GFP as a primitive operator. 

Definability of Simultaneous Fixed Points. It is com- 
mon, in the literature on fixed point logics, to consider also 
a simultaneous least fixed point operator, that takes as ar- 
guments not a single formula but a tuple of formulas. More 
precisely, in the context of GNFP it is natural to consider 
also formulas of the form [LFPxi S] (t) where 



Xi(xi) 



oi(xi) A(^i(Xi,. ..,X„,xi) 



S = 



, X„(x,i) ^ o,i(x„) A (^„(Xi 



, ^n , Xji J 



is a system of GNFP formulas, with each Xk a distinct 
second-order variable, whose arity matches the length of the 
tuple Xfc, and which occurs only positively in 0i, . . . , <^„, and 
where t is a tuple of terms of the same length as x^. Here, 
the system S can be viewed as defining a monotone opera- 
tion on tuples of relations, and the above formula expresses 
that t belongs to the i-th component of the least fixed point 
of this operation. It is well known that simultaneous fixed 
point expressions of this form can be expressed equivalently 
using a nesting of ordinary, single-variable, fixed point op- 
erators, possibly at the cost of an exponential blow-up in 
formula size (cf. for example [2]). Hence, extending GNFP 
with such a simultaneous least fixed point operator does not 
increase its expressive power. 

Disjunctive Normal Form for GNFO and Width. We 

say that a GNFO formula is in Disjunctive Normal Form 
(DNF) if it is a disjunction of disjunction- free GNFO for- 
mulas, no existential quantifier occurs directly below a con- 
junction sign, and no conjunction sign occurs directly below 
a negation sign. Equivalently, a GNFO formula is in DNF 
if it is a disjunction of GNFO formulas generated by the 
following recursive definition: 
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where, in the last clause, a is an atomic formula containing 
all free variables of cf). Every GNFO formula is equivalent 
to one in DNF, of possibly exponential size, that can be 
obtained by repeatedly applying the following equivalences. 

{3x<j>) Alp ^ 3x'{4>[x'/x] A i/;), </> A (V) V x) =i (0 A V) V {(/> A x) 
3x{(f> V ^/i) ~ 3x(j> V Bx^/j, a A -^(4) A V) ~ (a A -.</>) V (a A -.i/)) 

The width of a GNFO formula (j> is the number of variables 
occurring (free or bound) in the DNF formula obtained from 
(fi by applying the above rules. 

A union of conjunctive queries (UCQ) is a GNFO formula 
in DNF without negation. Thus, GNFO can be naturally 
viewed as an extension of UCQs with guarded negation. 



Known Decidability and Complexity Results. The fol- 
lowing theorem summarizes what is known about GNFO 
and GNFP that is relevant for present purposes. Recall 
that the satisfiability problem has as input a formula (^(x), 
and asks whether there exists a structure Af and a tuple of 
elements a such that M \— ^(a). The entailment problem 
takes as input two formulas (j}{x.), i/'(x), and asks whether it 
is the case that, for every structure M and for every tuple 
of elements a, M \— 0(a) implies M \= V'(3-)- The model 
checking problem has as input a formula (^(x), a structure 
M, and a tuple of elements a, and asks whether M \— ^(a). 

Theorem 2.1 ([t]) 1. The satisfiability problem and the 
entailment problem for GNFO and for GNFP are de- 
cidable and 2ExpTime-complete. This holds both for 
finite structures and for unrestricted structures. 

2. For GNFO formulas, satisfiability over finite structure 
coincides with satisfiability over unrestricted struc- 
tures, and similarly for entailment. The same does 
not hold for GNFP. 

3. The model checking problem for GNFO is pNP[iog^)_ 
complete (combined complexity) . For GNFP, the prob- 
lem is P^^ -hard and is contained in NP'"'^ n coNP'^^. 

In the above theorem, p^Pliog 1 refers to those problems 
that can be solved by a polynomial time deterministic algo- 
rithm that is allowed to ask 0(log'^(n)) queries to an NP- 
oracle, cf. Section |7.1[ A close analysis of the 2ExpTime 
upper bound argument for the satisfiability and entailment 
problems of GNFP shows that these results extend to the 
case with simultaneous fixed-point operators (both on finite 
structures and on unrestricted structures) rj 

2.1 Guarded Negation in Relational Algebra 

The concept of guarded negation can be equivalently cast 
in terms of the Relational Algebra, where negation is ex- 
pressed by means of the difference operator. Consider the 
Relational Algebra (RA) defined over a schema consisting of 
relation symbols of specified arity using the following prim- 
itive operators (cf. 'T for their semantics). 

Atomic Relations: every relation symbol belongs to RA. 
Selection: if _E G RA has arity k and 1 < i,j < k, then 



ai^j{E) belongs to RA and has arity k. 



< k, 



Projection: if _E £ RA has arity k and 1 < ii, . . . ,in _ 
then 7rij...i^(_E) belongs to RA and has arity n. 

Crossproduct: if E\,E2 £ RA have arity k and n, respec- 
tively, then El x E2 belongs to RA and has arity k-\-n. 

Specifically, the proof of the 2ExpTime upper bound for GNFP 
is based on a satisfiability preserving translation from GNFP 
to guarded fixed-point logic (GFP). The translation may give 
rise to an exponential blow-up in the size of the formula, but 
it preserves the width (following a suitable definition of width, 
analogous to the definition of width for GNFO formulas). The 
satisfiability problem for GFP formulas, in turn, is decidable in 
time 2P°'!'(l<^l)-<!=:p(width(,^)) (^here poly{n) is short for n°W and 

exp(n) is short for 2J'°'^(")) by a redu ctio n to the emptiness 
problem for a suitable type of automata [23[ |6|. The translation 
from GFP formulas to automata extends immediately to the case 
for formulas containing simultaneous fixed-point operators. Fur- 
thermore, the polynomial-time inductive satisfiability-preserving 
translation from GNFP to GFP given in J (which in fact simply 
commutes with the fixed-point operators) extends in a straight- 
forward manner to the case where the input and output formulas 
may contain simultaneous fixed-point operators. 



query := select {t\ as ATTRi, . . . ,tn as ATTR„) from (reLi Ri, . . . , REL,„ Rm) where condition 

I query union query \ query intersect query \ query except query 

condition := true | ti — t2 | i in query \ ex\sts{query) 

I condition and condition j condition or condition \ not{condition) 

Figure 1: Grammar for FO-SQL queries 



Union, Intersection, and Difference: if Ei,E2 G RA 
both have arity k, then EiL) E2, Eid E2 and E-i — E2 
belong to RA and have arity k. 

Codd's completeness theorem states that RA has the same 
expressive power as the domain-independent fragment of 
first-order logic, cf. [1]. Let us briefly recall here the defini- 
tion of domain independence for first-order formulas with- 
out constant symbols [l] . The active domain of a structure 
M is the set adom{M) oi all elements that occur in a tu- 
ple belonging to one of the relations. For any structure M, 
let M' be a copy of the same structure but where all ele- 
ments outside adom{M) are removed. A first-order formula 
(/>(x) without constant symbols is domam-mdependent if (i) 
whenever M \= 0(a) , then the tuple a consists of elements 
of adom{M), and (ii) for all tuples a consisting of elements 
of adom{M), M \= cj>{a.) if and only if M' i= (/>{&). The 
same definition applies to formulas with fixed-point opera- 
tors. Examples of first-order formulas that are not domain- 
independent are P{x) V Q(y), x = x, and -^P(x). 

We say that a relation algebra expression is negation- 
guarded if every occurrence of the difference operator is of 
the form iii^...i,^{R) ~ E where i? is a relation symbol. We 
denote by GN-RA the negation-guarded fragment of RA. It 
can be shown by straightforward inductive translations that 
GN-RA captures GNFO in the following sense. 



Theorem 2.2 Every k-ary GN-RA expression is equivalent 
to a domain-independent GNFO formula 4>{xi, . . . ,Xk), and 
vice versa, via a linear translation from GN-RA to GNFO 
and an exponential translation backwards. 

Let R, S be relation symbols of arity 2 and 1, respectively. 
The following RA expressions are not negation-guarded. 

(7ri(_R) X 5") — 7ri,i(_R) (distinct pairs from %\{R) x S) 

7ri.4((T2=3(-R X R)) — R (reachability in two steps, not one) 
7ri(i?) - 7ri((7ri(_R) X S) - R) (the quotient i? -^ 5) 

In fact, it follows from results in '7' that none of these ex- 
pressions is equivalent to a GN-RA expression. 

Observe that in the above definition of GN-RA we did 

not allow for the use of constant values in selections and 

projections. This was only to simplify presentation. All 

complexity results that we will present go through in the 
presence of constant values, cf. Section [9] 

3. GUARDED NEGATION IN SQL 

In this section, we discuss what it means for an SQL query 
to have guarded negation. More precisely, we consider a sim- 
ple, first-order expressively complete, fragment of SQL with 
a set-based semantics, that we call FO-SQL, and we char- 
acterize the queries in this fragment that can be expressed 
in GNFO. 



FO-SQL: a Simple First-Order Fragment of SQL. In 

this section, unlike in the rest of the paper, we work with 
named schemas. A named schema is a collection of rela- 
tion names, each with an associated list of attribute names. 
For the discussion below, assume we have a fixed schema, 
say, consisting of book(isbn, AUTHOR, title) and LOCA- 
tion(isbn, SHELF, number). We also fix an infinite supply 
of "tuple variables" (also known as aliases, and denoted by 
-Ri, 7?2, . . .). By a term t we will mean an expression of the 
form _Ri.ATTR where Ri is a tuple variable and ATTR is an 
attribute name. 

We consider SQL expressions that are generated by the 
simple grammar given in Figure IT] where each ti is a term, 
each RELi is a relation name, ATTRi , . . . , ATTR„ are dis- 
tinct attribute names, and Ri, . . . , Rm are distinct tuple 
variables. This grammar generates queries that may have 
free tuple variables, i.e., there may be occurrences of tuple- 
variables Ri that are not in the scope of a select-from-where 
clause where they are declared. We will refer to queries 
with free tuple variables as open queries (or correlated sub- 
queries), and we refer to queries without free tuple variables 
as closed queries (or uncorrelated subqueries). We will de- 
note by FV{q) the set of free tuple variables of q. We will 
be mainly interested in closed queries. Note, however, that 
closed queries are allowed to contain subexpressions of the 
form exists(g) or of the form ting where q is an open query. 

We only consider queries that are well-typed in the sense 
that each (open or closed) query can be consistently assigned 
a (unique) type, which is a list of attribute names, where 

1. the type of a select-from-where query is the set of at- 
tribute names specified in its select clause; 

2. the union, intersect, and except operators take as ar- 
guments two queries of equal type, yielding a query of 
the same type. 

Furthermore, terms i?i.ATTR are only allowed to occur when 
ATTR belongs to the schema of the relation to which the 
occurrence of Ri in question is bound, and conditions of the 
form i in g are allowed only when g is a unary query, i.e., 
when the type of q consists of a single attribute. 

By an FO-SQL query, we will mean a closed query sat- 
isfying the above requirements. Two examples are given in 
Figure |2] We assume that the reader is familiar with the 
semantics of SQL, and hence omit the formal semantics of 
the fragment FO-SQL. We just mention that we disregard 
order and duplicates, treating relations as sets of tuples. 
It is known that, under this set-based semantics, FO-SQL 
is expressively complete for first-order logic, in the sense 
of Codd's expressive completeness theorem [T| [29]. That 
is, FO-SQL queries have the same expressive power as the 
domain-independent fragment of first-order logic. Since FO- 
SQL queries are defined in terms of named schemas, while 
the syntax of first-order logic is based on unnamed schemas 
in which the attributes of a relation are identified by natural 
numbers instead of by attribute names, here, we consider a 



select ^.NAME from AUTHOR A where not exists( 
select B. TITLE from BOOK B where B.AUTH = A. name) 

select yl.NAME from AUTHOR A where not exists( 
select B. TITLE from BOOK B where not B.AUTH = A. name) 

Figure 2: Two examples of FO-SQL queries, the first 
negation-guarded and the second not. 



FO-SQL query q of type {Ai, . . . , j4„} to be "equivalent" to 
a first-order formula 4>{xi, . . . ,a;„), containing the relation 
names RELi occurring in q as relation symbols of appropri- 
ate arity, if for every instance / and for every n-tuple a, the 
tuple a is an answer to g in / if and only if / |= (?i(a). 

The most important features of (full) SQL that are ex- 
cluded in the above definition of FO-SQL are constants, 
arithmetical comparison, and aggregation. We will discuss 
the importance of these restrictions later in Section |9] 

GN-SQL: the Guarded-Negation Fragment of FO-SQL. 

We say that an SQL query is negation-guarded if the follow- 
ing two conditions hold: 

1. each except operator has as its first argument a simple 
projection and as its second argument an uncorrelated 
subquery. 

2. each not operator has as its argument a condition with 
at most one free tuple variable. 

Here, by a simple projection, we mean a select-from-where 
query, where the where-clause is 'true'. GN-SQL is the frag- 
ment of FO-SQL consisting of all negation-guarded queries. 

To illustrate this definition, consider the two queries given 
in Figure [2] The first query involves a single occurrence of 
negation, which is guarded, since the negated condition has 
only one free tuple variable, namely A. The second query, 
on the other hand, is not a GN-SQL query, since the second 
occurrence of negation is not guarded. Indeed, the condition 
B.AUTH = ^.NAME has two free tuple variables. 

The next theorem states that GN-SQL captures GNFO, in 
the same way that FO-SQL captures full first-order logic, as 
we discussed above (the same conventions apply, concerning 
what it means for a FO-SQL query to be equivalent to a 
first-order formula). 

Theorem 3.1 (GN-SQL is Codd-complete for GNFO) 

Each GN-SQL query can be translated in linear time into an 
equivalent domain-independent GNFO formula. Gonversely, 
each domain-independent GNFO formula can be translated 
in exponential time into an equivalent GN-SQL query. 

It can be shown that the exponential complexity of the 
translation from GNFO to GN-SQL is in general unavoid- 
able for formulas of the form {R{xi)\/ S{xi))A ■ ■ ■ A(-R (a:„) V 
S{x„)). On the other hand, the proof of Theorem 3.1 shows 
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that if the schema includes a unary relation ADOM that is 
guaranteed to denote the active domain of the instance, then 
there is a polynomial translation. 

3.1 Negation in Practice: a Benchmark Study 

In order to assess the usage of negation in SQL queries in 
practice, we have studied the workloads of two standard SQL 
benchmarks, namely TPC-H ||37j and TPC-DS [36]. These 



^ By negation, we mean any occurrence of not or except. 

^ An inequality is any occurrence of <> or !=. An inequality is 

guarded if the corresponding negation not(. ..= ...) is guarded. 

Figure 3: Usage of negation in SQL benchmarks 



benchmarks were designed to evaluate and compare the per- 
formance of relational database management systems. In 
addition, we studied the sample queries published on the 
Sloan Digital Sky Survey (SDSS) SkyServer website [35], a 
selection of actual queries submitted by SDSS users. For 
each query, we investigated whether the query uses nega- 
tion, and, if so, whether the query is negation-guarded. We 
also studied the use of inequalities, and investigated which of 
these inequalities can be expressed using guarded negation. 
The results, given in Figure[3] shows that most queries using 
negation use only guarded negation. We should note here 
that most queries contain SQL constructs, such as aggrega- 
tion, that do not belong to FO-SQL. Therefore, the queries 
are not necessarily expressible in GN-SQL. The statistics in 
Figure [3] are only concerned specifically with the explicit use 
of negation. We also did not investigated the use of other 
SQL constructs such as outer joins, that can, in some sense, 
be viewed as involving an implicit form of negation. 

4. GUARDED NEGATION IN DATALOG 

In this section, we present a powerful variant of Data- 
log with stratified negation, which we call GN-Datalog and 
which, in terms of its expressive power, is contained in 
GNFP. We first briefly recall the syntax and semantics of 
Datalog, with and without stratified negation. 

Definition 4.1 (Datalog) A Datalog program is specified 
by a triple H = (EDB",IDB", Rules"), where EDB" and 
IDB"^ are disjoint sets of relation names, each with an asso- 
ciated arity, and Rules'^ is a finite set of rules of the form 

4- l/)l , . . . , V'n 

where (j),^i, . . . ,tpn are atomic formulas of the form 
R{xi,...,Xn) with R e EDB" U IDB" and xi,...,Xn a 
sequence of first-order variables of appropriate length. We 
refer to (j> as the head of the rule, and '(/'i , . . . , i/Jn as the body 
of the rule. In addition, we require that (i) every first-order 
variable occurring in the head of a rule must occur in the 
body, and (ii) the relation in the head of each rule must be 
an IDB relation. 

A Datalog query is a pair (11, Ans), where H is a Datalog 
program and Ans is a union of conjunctive queries over the 
schema EDB"^ U IDB". The semantics of a Datalog query 
is defined as follows: first, if 11 is a Datalog program, I 
an instance over the schema EDB'^, and k a natural num- 
ber, then we denote by H*'(/) the instance over the schema 
EDB'^ U IDB'^ containing all facts that can be derived from 
the facts in / using at most k rounds of applications of rules 



of n. In addition, we denote by 11°° (7) the union |Jfe n (/). 
If g = (n, Ans) is a Datalog query and / an instance over 
the schema EDB'^, then we denote by q{I) the set of all 
tuples that are an answer to the query Ans in n°°(/). 

We remark that the above definition differs slightly from 
the standard presentation of Datalog. Usually, Ans is re- 
quired to be a designated relation from IDB"^ instead of a 
union of conjunctive queries. The presentation we use here 
is convenient as it helps simplify the definitions below. On 
the other hand, note that this is not essential: a Datalog 
program can always be extended with an additional IDB re- 
lation and with additional rules computing the Ans query. 

Definition 4.2 (Datalog with Stratified Negation) 

A Datalog^ program is a Datalog program 11 where the 
body of each rule may, in addition, contain atomic formulas 
of the form -<R{xi, . . . ,x„) provided that 7? £ EDB"^, 
and provided that each first-order variable occurring 
in the head or body of the rule occurs positively in 
the body. A Datalog program with stratified negation 
is a sequence II — (IIi, . . . , n,i) of Datalog^ programs, 
called strata, with n > 1, where for each i — 2...n, 
EDB"- = EDB^'-i U IDB"'-!. We use EDB" and IDB" 
to denote EDB^i and Ui=i „ IDB"% respectively. 

A Datalog query with stratified negation is a pair (IT, Ans), 
where 11 = (Hi, . . . , n„) is a Datalog program with strat- 
ified negation and Ans is a union of conjunctive queries 
over the schema EDB'^ U IDB'^. The semantics of Dat- 
alog programs and of Datalog queries extends naturally 
to Datalog with stratified negation, by defining II°°(/) as 

n-(n- i(- ■ ■ nr (7) • • • )) for fi = (Hi, . . . , n„). 

We say that a Datalog program 11 is non-recursive if no 
IDB occurs in the body of any of its rules, and hence, in 
particular, for all instances 7 we have that 11°° (7) = 11^^(7). 
We say that a Datalog program with stratified negation is 
non-recursive if it consists entirely of non-recursive strata. 

Definition 4.3 (GN-Datalog) A GN-Datalog program 
is a Datalog program with stratified negation IT = 
(III, ... , n„), where each rule 

00^ (-)<^i,...,(-)<^n G Rules"" (l<fc<n) 

is negation guarded, meaning that the following holds: 

For each atom <j!>i that either occurs negated in the 
body or is the head, the body includes a positive atom 
(j}j containing all first-order variables occurring in (pi, 
and cj>j uses a relation from EDB^*" rl 

A GN-Datalog query is a Datalog query with stratified nega- 
tion, where each rule is negation guarded. Note that this 
requirement concerns only the rules; the answer query Ans 
can be any union of conjunctive queries. 

Theorem 4.4 (Non-recursive GN-Datalog is Codd- 
complete for GNFO) Each non-recursive GN-Datalog 
query is equivalent to a domain-independent GNFO formula, 
and vice versa, via exponential translations. 

To understand why this is the appropriate definition of negation 
guardedness, observe that a rule of the form <p <— ipi, . . . ,ip„ 
expresses that -iElx(i/'i A ■ ■ ■ A ipn A —•<!>), i.e., the head of the rule 
plays the same role as a negated atom in the body. 



The translation from non-recursive GN-Datalog to GNFO 
given in the proof of Theorem |4.4| can be extended in a 
straightforward manner to a translation from GN-Datalog 
to the extension of GNFP with simultaneous fixed-point op- 
erators. Since simultaneous fixed-point operators can be 
eliminated (at the cost of an additional exponential blow- 
up, cf. Section [2|, we obtain following: 

Theorem 4.5 Each GN-Datalog query is equivalent to a 
domain-independent alternation-free GNFP formula. 

The translation from non-recursive GN-Datalog to GNFO 
provided by Theorem |4.4| involves an exponential blow-up, 
due to an elimination of subformula sharing. The transla- 
tion from GN-Datalog to GNFP provided by Theorem |4.5| 
involves another exponential blow-up, due to the elimina- 
tion of simultaneous fixed-point operators. These sources 
of exponential complexity can be avoided (i) if we tran- 
scribe GN-Datalog queries into GNFP formulas over a larger 
schema (containing a relation symbol not only for each EDB 
of the GN-Datalog query, but also for each IDB), and (ii) 
freely use simultaneous fixed-point operators in the GNFP 
formula. More precisely, the proof of Theorem |4.5| can be 
adapted in a straightforward manner to show the following 
result, which will be useful later on (where, for two schemas 
5 C S, an S -expansion of an instance 7 over 5* is an instance 
over S that agrees with 7 on all facts over S). 

Theorem 4.6 For every k-ary GN-Datalog query q over a 
schema S one can compute in polynomial time a GNFP sen- 
tence (j>q and a GNFP formula il}q{xi, . . . , Xk), both with si- 
multaneous fixed point operators, and over a possibly larger 
schema S, such that 

1. each instance I has a unique S-expansion I satisfying (j)q. 

2. for all instances I and k-tuples a, a£ q{I) iff I \— x}jq{a). 

5. RELATIONSHIPS WITH EXISTING 
LANGUAGES 

Monadic Datalog is a well-known Datalog fragment that 
combines an interesting level of expressiveness with good 
algorithmic behavior thanks to a tight connection with tree 
automata, which also make monadic Datalog suitable for 
a number of applications, e.g. [21]. It also stands out a 
fragment for which the boundedness problem is decidable 
1171, Theorem 8.2 below. Monadic Datalog does not allow 



any form of negation and since all IDB predicates are unary, 
guardedness of rule heads is guaranteed, so that monadic 
datalog rules are trivially negation guarded. We will show 
that boundedness remains decidable for GN-Datalog. 

Datalog LIT is a fragment of stratified Datalog whose 
model checking has linear-time data complexity [20| . Each 
Datalog LIT rule must either contain in its body as 'guard' 
a positive literal containing all variables occurring in the 
rule, or must solely be comprised of unary literals (includ- 
ing its head). While the 'guard' of guarded rules need not be 
an EDB atom, [20| shows that every Datalog LIT program 
can (in exponential time) be transformed into an equivalent 
one having only EDB atoms as guards. The latter are triv- 
ially negation guarded. Every Datalog LIT program is thus 
equivalent to a GN-Datalog program. 



GNFO subsumes a number of formalisms having currency 
in ontological reasoning, such as the hnear- and guarded tu- 
ple generating depen den cies (tgds) underlying the recently 
promoted Datalog 12 framework and the more general 
frontier-guarded tgds 5 that subsume the description logics 
DL-Lite_R (which captures RDF Schema), f £1, and SCWj^ 
[3], which is the core of the proposed OWL-EL profile of 
the OWL 2.0 ontology language. GNFO can encode query 
answering and containment assertions involving such speci- 
fications of constraints or TBoxes. A tgd is a sentence 



Theorem |6 . 1 1 generalizes the known decidability result for 

In 



Vx,y <?!'(x,y) -> 3z ip{y,z) 



(2) 



where both and ?/), called the body and the head, respec- 
tively, of the tgd rule, are conjunctions of positive atoms. 
When working under OWA one can assume, as a matter of 
convenience, that the head of every tgd is a single atom. 
A tgd is linear if (f> consists of a single atom; it is guarded 
if (j) contains as conjunct an atom _R(x,y), the 'guard', in 
which all of the variables of the body occur together; and 
it is frontier-guarded if the body contains an atom P(y) in 
which all of the variables shared by the body and the head 
of the rule occur together. Every frontier-guarded tgd nat- 
urally translates to a GNFO sentence. 

Other query languages that can be viewed as fragments of 
GNFO include the semi-join algebra [28], as well as Unary 
Conjunctive View Logic (UCV) and Core XPath, cf. [li]. 
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addition, it easily implies the decidability of containment 
of Datalog queries in Unions of Conjunctive Queries |15| . 
This can be seen as follows. For each Datalog query q, let 
q be the GN-Datalog query obtained from q by guarding 
each rule using an additional conjunct that is a fresh EDB 
relation. Then for each UCQ q' over the original schema, 
we have that q is contained in q' if and only if q is contained 
in q' . One direction follows directly from the fact that q 
is contained in q. For the other direction, note that every 
counterexample I to the containment of q in q' gives rise 
to a counterexample I' to the containment of g in q' . The 
instance /' in question extends I by interpreting each new 
EDB relation as the total relation containing all tuples over 
the active domain of I. 

As a direct consequence of the fini te m odel property of 
GNFO [7 and of Theorems 3.1 and 4.4 respectively, we 
find that query containment is finitely controllable for GN- 
SQL queries and for non-recursive GN-Datalog queries. By 
this we mean that one query is contained in an other on 
finite instances if, and only if, the containment holds on 
unrestricted instances (a finite model property). 

Theorem 6.2 Satisfiability and containment are finitely 
controllable for GN-SQL and for non-recursive GN-Datalog. 



6. QUERY CONTAINMENT 

We now exploit the connection with GNFO and GNFP 
to show that query containment is decidable for GN-SQL 
and for GN-Datalog. Recall that a query q is satisfiable if 
there exists an instance I such that the set of answers q{I) is 
non-empty, and that a query qi is contained in a query §2 if, 
for all instances /, qi(/) C q2{I)- The satisfiability problem 
can be viewed as (the complement of) a special case of the 
query containment problem, where the second query 52 is 
any fixed unsatisfiable query. 

Theorem 6.1 Query containment is SExpTime-complete 
for both GN-SQL queries and GN-Datalog queries. Hardness 
holds already for satisfiability of non-recursive GN-Datalog, 
and GN-SQL, over a fixed EDB schema. 

The 2ExpTime upper bounds for GN-SQL follow directly 
from Theorem |3.1| and Theorem |2.1| The 2ExpTime upper 
bounds for GN-Datalog do not follow directly from Theo- 
rem |43] and Theorem |2.1 1 due to the exponential complex- 
ity of the translation from GN-Datalog to GNFP involved. 
However, it follows using Theorem |4.6| let gi , 52 be fc-ary 
GN-Datalog queries {k > 0), and let 01, tpi{xi, . . . ,Xk) and 
(fi2,ip2{xi, . . . ,Xk) be the GNFP-formulas with simultane- 
ous fixed point operators obtained by Theorem |4.6| We 
may assume without loss of generality that the only rela- 
tion symbols that 4>\,il^i and 02, V'2 have in common are the 
relation symbols that appear in q\ and 52. It follows that 
gi is contained in 52 if and only if 0i A '4)i{xi, . . . ,Xk) \~ 
4>2 — >■ ^2{xi, . . . ,Xk)- This gives us the desired result, since, 
as we explained in Section [2J the 2ExpTime upper bound 
for GNFP entailment from Theorem 12. II extends to the case 
with simultaneous least-fixed point operators. The lower 
bounds are obtained by adapting the proof of an 2ExpTime- 
hardness result for a fragment of GNFO in ,14.. 



7. QUERY ANSWERING 

7.1 (Closed- World) Query Evaluation 

Since GN-SQL and non-recursive GN-Datalog admit 
translations into first-order logic, the data complexity of 
query evaluation is in AC'' for both query languages. Simi- 
larly, since GN-Datalog is contained in the fixed-point logic 
FO(LFP), the data complexity of query evaluation is in 
PTime. In fact, there is a GN-Datalog query (a monadic 
Datalog query) for which query evaluation is PTime-hard in 
terms of data complexity [20| . 

In what follows we consider the combined complexity 
of query evaluation. Datalog evaluation is known to be 
ExpTime-complete for combined complexity (implicit in 



39 ). M ona dic Datalog evaluation is known to be NP- 
complete [21]. The "guarded fragment of Datalog" (every 
rule contains an EDB atom containing all variables occur- 
ring in the rule) evaluation is in PTime [20]. Non-recursive 
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Recall that p^Pliog 1 jg ^jjg dagg of those problems that 
can be solved by a polynomial time deterministic algorithm 
that is allowed to ask 0(log^(n)) queries to an NP-oracle. 
(It relates to better known complexity classes this way: NP 



C DP C P^P ['°«5l C pNP[iog^] c 
Sf C PSPACE C EXPTIME.)" 



^ pNP [log'] ^ pNP ^ 



Theorem 7.1 The combined complexity of evaluating GN- 
SQL queries is p^PI'^g \. complete. 

Proof. The upper bound follows directly from Theo- 
rem |3.1| and Theorem |2.1| For the lower bound, observe 
that the translation from GNFO to GN-SQL given in the 
proof of Theorem |3.1| is polynomial in the presence of a 
unary relation ADOM containing all elements in the active 
domain. We may assume without loss of generality that our 



input instance contains such a relation. Therefore, the lower 
bound from Theorem |2 . 1 1 extends to GN-SQL as well. D 

We show here that the same problem is P'^^-complete for 
GN-Datalog. Recall that the best known upper bound on 
the complexity of model checking GNFP is NP^^ncoNP^''. 

Theorem 7.2 The combined complexity of evaluating GN- 
Datalog queries is P -complete. Hardness holds already 
for non-recursive GN-Datalog queries with only unary IDB 
predicates and nullary negation. 

7.2 Open- World Query Answering 

Open world (OWA) query answering is the following prob- 
lem: given a query q, an instance I , and a tuple of values a, 
decide whether it is the case that a belongs to the answers 
of q in every instance extending I with additional facts. An 
instance of open-world query answering / |=owa ?(a) thus 
asks for the unsatisfiability of 7 U {-'^(a)} in the usual first- 
order semantics, when treating 7 as a set of atomic facts with 
its elements as constants. Open world semantics is the natu- 
ral choice when working with incomplete databases, in data 
exchange settings, and in the context of ontological reason- 
ing. In each of these settings, open world query answering 
is an extensively researched problem. 

In this section, we investigate the data complexity of open- 
world query answering for queries with guarded negation. 
Formally, for each query q we denote by OWA, the problem, 
given an instance 7 and a tuple of values a from adom{I), 
to decide whether 7 |=owa g(a). More generally, for each 
query q and for each set of constraints E, we denote by 
OWAq,E the problem, given an instance 7, to decide whether 
7, E ^owA q. 

Note that, in the absence of constraints, for conjunctive 
queries q, by monotonicity, the problem 7 ^owa ?(a) coin- 
cides with 7 1= g(a), and therefore OWA, is in PTime (in 
fact, in AC"). For first-order queries q, on the other hand, 
the problem OWA, can be undecidable. We will show be- 
low that the problem is decidable for first-order queries with 
guarded negation. 

As constraints, we will consider tuple-generating depen- 
dencies, cf. (pi, and key constraints. As noted above, linear-, 
guarded- and frontier-guarded tgds [5] [12] are expressible in 
GNFO. With respect to OWA query answering, conjunc- 
tive queries are known to be FO-rewritable relative to linear 
tgds [12| and possess Datalog rewritings relative to frontier- 
guarded tgds p] . Accordingly, the data complexity of open- 
world query answering for conjunctive queries against lin- 
ear tgds is in ACo, in PTime for frontier-guarded tgds, and 
PTime-complete already for guarded tgds 
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We begin by observing that OWA query answering for 
GNFO queries, as for many description logics [34[ 13 32 



has coNP data complexity. For an instance 7, we denote by 
|7| the total number of facts of 7, and, for two instances 7, J, 
we write 7 C J if every fact of 7 is a fact of J. 

Proposition 7.3 Let 4'{x) be a fixed GNFO formula. For 
an instance I and a tuple a of elements from adom[I), if 
there is an instance J \= (p{a,) with I (^ J, then there is an 
instance J \— (f>{a) with I ^ J and \J\ — 0(|7|). 



consider ony instances whose size is linear in the size of the 
input instance. This gives us the following: 

Theorem 7.4 For each GNFO query q (in particular, for 
each GN-SQL query), OWAq is in coNP. There is a boolean 
GN-SQL query q for which OWAq is coNP-hard. 

Proof. The coNP upper bound is immediate from the 
above proposition. Given an instance 7 with distinguished 
elements a. Proposition 7.3 shows that to refute 7 ^owa 
g(a) it suffices to guess a linear size instance J with 7 C J 
and test in polynomial time that J satisfies ^q(a). The 
lower bound is established by a reduction from 3-colorability. 
Let q be the GNFO sentence (for readability, we omit the 
repeated occurrences of Nx as guard) : 

3x{NxA^PixA^P2xA^P3x) V \J 3xy{ExyAPiXAPiy) (3) 

i 

expressing that Pi , P2 , 7^3 do not constitute a valid 3- 
coloring of the graph (N, E). It is easy to check that a sim- 
ple undirected graph G is not 3-colorable iff G |=owa q, and 
it is straightforward to formulate the domain-independent 
boolean query ([3} in GN-SQL. D 

This is remarkable, given that open- world query answer- 
ing is in general undecidable for first-order queries, even in 
the absence of constraints. 

Recall that every frontier-guarded tgd can be formulated 
as a GNFO sentence. This allows us to lift the above result 
to the open- world query answering problem with constraints 
that are frontier-guarded tgds. More precisely, if E is a set of 
frontier-guarded tgds, then OWA^^s, by definition, coincides 



with OWA, 



9VV„ 



and therefore we get the following. 



Proposition |7.3| tells us that, in solving the open- world 
query answering problem for GNFO queries, it suffices to 



Corollary 7.5 For each GNFO query q and for each finite 
set of frontier- guarded tgds E, OWAq^s: is in coNP. 

In various contexts, such as data exchange |19j , it is use- 
ful to consider incomplete databases that contain, besides 
constant values, also labeled null values. In this case, open 
world query answering is defined not in terms of extensions 
of instances, but in terms of homomorphisms that are al- 
lowed to map the labeled null values to constant values or 
to other labeled null values. It is worth observing that the 
above proofs go through in this more general setting with 
null values, showing that for GNFO queries q and for finite 
sets of frontier-guarded tgds E, OWA^.e is in coNP even 
over instances containing labeled nulls. 

Next we identify a subfragment of GNFO that accommo- 
dates the earlier mentioned formalisms including conjunctive 
queries and frontier-guarded tgds and whose queries enjoy 
PTime data complexity for OWA. Recall that open- world 
query answering 7 |=owa q asks for the unsatisfiability of 
7 U {^g}. Under negation, the subformula 3x{Nx A -^P\x A 
-^P2xf\^Pi,x) of the coNP-complete query Q turns into the 
disjunctive requirement \/x{Nx — ^ Pix V P2X V Psx) that is, 
in an intuitive sense, ultimately responsible for intractabil- 
ity. Indeed, it has been observed in the context of DL-Lite 
that the introduction of even the weakest form of disjunc- 
tion renders query answering intractable (see, e.g., |13| ). 
It turns out that the positive occurrence of conjunctions 
-^A{x)A. . .f\—'B(x) involving two or more negated conjuncts 
are the only source of intractability in GNFO queries. 



Definition 7.6 (serial GNFO queries, SGNQ) 

A GNFO-fomiula ip is serial if it is in DNF and no conjunc- 
tion ~^x{^) A ... A -^ip{x) witii two or more negated conjuncts 
occurs positively in tp, i.e., in the scope of an even number of 
negations. Let SGNQ denote the set of serial GNFO queries. 

Clearly, every union of conjunctive queries is a serial 
GNFO query. Furthermore, every frontier-guarded tgds, 
as well as its negation, is equivalent to a boolean serial 
GNFO queries. It fact, for every finite set E of frontier- 
guarded tgds and for every serial GNFO query g, we have 
that q V VcrgE "'"^ i® * serial GNFO query. In other words, 
the reduction from open-world query answering in the pres- 
ence of frontier-guarded tgds to open- world query answering 
in the absence of tgds, that we gave earlier, holds also in the 
case of serial GNFO queries. 

Theorem 7.7 For each SGNQ q and for each finite set E 
of frontier-guarded tgds, OWAq^s is in PTime. 

In fact, for every boolean SGNQ q we can effectively com- 
pute a boolean Datalog query q' such that for all instances 
I , we have I |=owa q <=> I \= 1 ■ 

There is a boolean SGNQ query q for which OWAq is 
PTime-complete. 

The proof is based on a reduction from the open-world 
query answering problem for SGNQs in the presence of 
frontier-guarded tgds to the open-world query answering 
problem for conjunctive queries in the presence of frontier- 
guarded tgds. A PTime solution of the latter problem via 
Datalog rewritings is due to [3. 

Finally, we show that OWA answering GNFO queries un- 
der key constraints is undecidable. This holds even for a 
fixed GNFO query and a fixed key constraint of the form 
Vx3/z(_F(x, y) A -^(x, z) —^ y — z) with F a relation symbol. 

Theorem 7.8 (i) There is a boolean conjunctive query q 
and a set E comprising guarded tgds and a single key 
constraint, so that OWA^^j^ is undecidable. 
(ii) There is a boolean SGNQ q and a key constraint a, so 
that OWAq^^cr} is undecidable. 

While undecidability of the uniform problem (where the 
query is part of the input) follows from various similar re- 
sults for weaker formalisms [34| , for a fixed query this seems 
to be a new result. 
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8. BOUNDEDNESS 
DEFINABILITY 

In this section, we study the boundedness problem for 
GN-Datalog. Our main result, Corollary |8.9| states that it is 
decidable whether a GN-Datalog program is fully bounded, 
i.e., whether, for every instance, the computation of each 
stratum of the GN-Datalog program reaches a fixed point 
in a bounded number of steps. 

The semantics of a Datalog program II can be defined 
in terms of a least fixed point for the IDB predicates. For 
this we view II, or rather each of its instatiations IIj over 
a given instance I, as a monotone operator. An application 
of this operator to any instantiation of the IDB predicates 
produces the result of firing all rules once and in parallel, 
on these IDB predicates and the static EDB predicates as 
given in I. This operator 11/ is monotone, and the desired 



interpretation of the IDB predicates in 11°° (/) is its unique 
least fixed point. This view extends to not necessarily finite 
instances I, where 11/ , due to its monotonicity, still has a 
unique least fixed point, also refered to as n°°(J). As in the 
case of finite instances, this fixed point is obtained as the 
limit of the monotone sequence of stages IIj generated by 
iterating 11/ as an update operator, starting from the empty 
instantiation for all IDB predicates in stage 0, and taking 
unions at limit ordinals, until finally (for cardinality reasons) 
a stage Of is reached that is a fixed point, and indeed the 
unique least fixed point (n^+^ = Hf implies Hf = n°°(/)). 

All these considerations hold for any notion of program or 
recursion scheme that shares the crucial monotonicity with 
Datalog programs. Monotonicity refers to monotonicity in 
the IDB arguments, and is guaranteed by syntactic positiv- 
ity in the IDB predicates in all cases we consider. We are 
mostly interested in IDB-positive GNFO-programs, which 
we first investigate in isolation, towards understanding their 
stratified, and overall no longer monotone, use in GN- 
Datalog (cf. Definition |8.4| below) . 

The notion of boundedness captures the semantic and pro- 
cedural essence of non-recursive behavior (in contrast with 
syntactic non-recursiveness as defined in Section [4] which 
focuses on a trivial reason for boundedness). 

Definition 8.1 A monotone program II is c-bounded 
(bounded in the classical sense, or over unrestricted in- 
stances) if there exists some n € N such that fl""*"^ — H" 
for every finite or infinite instance /. It is bounded over a 
class of instances X if there is such an n that is good for all 
I £T. We call II f-bounded if it is bounded over the class of 
all finite instances. 

BDD('P,I) stands for the boundedness problem for programs 
from V over instances from I: given H gV, decide whether 
n is bounded over T. We reserve the names BDDc('P) and 
BDD/(P) for BDD(7',All) and BDD(P,Fin), where All 
and Fin are the classes of unrestricted and of finite instances, 
respectively. 

Despite its basic nature, the boundedness problem is 
known to be undecidable for even very rudimentary classes 
of programs - a fact which frustrated all hopes to systemat- 
ically eliminate bounded, i.e. spurious, recursion in effective 
tools for query optimization. See for instance [24j for the un- 
decidability of (f-)boundedness for Datalog programs with 
binary IDB predicates, as well as for Datalog programs with 
just monadic IDB predicates but with EDB negation or even 
just with inequalities in the bodies. One of the few major 
decidability results is the following from ^17.. 

Theorem 8.2 (Cosmadakis,Gaifman,Kanellakis,Vardi) 

BDD/(P) = BDDc(P) IS decidable for the class V of all 
monadic Datalog programs. 

The following result from classical model theory is of fun- 
damental importance for links between boundedness and 
first-order (FO) definability. It speaks about IDB-positive 
programs II that are first-order in the sense that the bodies 
of rules can be expressed in FO, by formulas that are posi- 
tive in all IDB predicates (which guarantees monotonicity). 
We use the term first-order programs in this sense. We say 
that the fixed point of II is FO-definable over the class T 



if each IDB predicate in the least fixed point 11°° (7) is de- 
finable in terms of the EDB predicates by some first-order 
formula, uniformly across all / G I. 

Theorem 8.3 (Barwise-Moschovakis [sj) An IDB- 

positive first-order program 11 is bounded in the classical 
sense if, and only if, the fixed point of ff is FO-definable 
over the class of all (finite and infinite) instances. 

Analogous equivalences can be derived for many natural 
fragments L C FO, where boundedness of IDB-positive L- 
programs is equated with L-definability of their fixed points. 
This is true in particular also for the guarded negation frag- 
ment GNFO C FO. 

Moreover, for many well-behaved fragments L C FO there 
are model theoretic transfer results that say that an L- 
program 11 is bounded over T if, and only if, it is bounded 
over some subclass lo ^ I- A case of particular interest 
is a finite model property for boundedness, which links the 
classical notion to its finite model theory version. This, too, 
is available in the case of GNFO. 



Definition 8.4 A GNFO-program is an IDB-positive pro- 
gram n with rules of the form 

Xxs ^ as(xs) A <j!>s(X, Xj,) 

where 4'a G GNFO is positive in the IDB predicates X and 
as is an EDB atom guarding the variable tuple Xs in the 
head. 

The following say that for GNFO we are in the ideal situa- 
tion that f-boundedness and c-boundeness coincide, and that 
the classical and finite model theory variants of the Barwise- 
Moschovakis correspondence hold. The finite model theory 
analogue is the least straightforward of theserl 



Proposition 8.5 For GNFO -programs 11 and their least 
fixed points II°°, t.f.a.e.: 

(i) n°° is FO-definable over all finite instances, 
(ii) n°° is FO-definable over all unrestricted instances. 
(Hi) n°° is GNFO-definable over all finite instances, 
(iv) n°° is Gf^FO- definable over all unrestricted instances, 
(v) If is bounded over all finite instances, 
(vi) If is bounded over all unrestricted instances. 

Another crucial transfer property for BDD(GNFO) is 
based on the notion of treewidth. In [t], it was suggested 
that the key to the good computational behavior of GNFO 
and GNFP lies in the fact that these logics have a tree- 
like model property: for testing the satisfiability and the 
entailment of formulas, it suffices to consider structures of 
bounded treewidth. The same notion provides the key to 
decidability of boundedness as well. 

The width w(n) of a GNFO-program II is the max;imum 
number of element variables used in any of its rules in DNF. 



It is known, for instance, that the universal fragment of 
FO, despite its finite model property, does not satisfy this 
analogue: there is a purely universal program whose limit 
is uniformly definable in universal FO across all finite in- 
stances, although it is unbounded over finite instances. 



Lemma 8.6 A GNFO-program H of width < w is bounded 
over all unrestricted instances if, and only if, it is bounded 
over the class of all (possibly infinite) instances of treewidth 
at most w. 

Proof. Each finite stage X" of If can be defined by a 
sequence of GNFO-formulas whose width is bounded by w. 
In particular, for each natural number n, boundedness of 
n at stage n > 1, w.r.t. a class of structures, is equivalent 
to the validity of a certain GNFO-sentence of width w, on 
that class of structures. Since a GNFO-sentence of width 
w is valid on arbitrary structures if and only if it is valid 
on structures of treewidth at most w (cf. [tIISS]), the claim 
follows, n 

We turn to decidability of BDDc(GNFO) and of full 
boundedness (to be defined below) of GN-Datalog. Given 
the meager history of decidability results concerning bound- 
edness for database purposes, it is interesting that here is 
one considerable extension of the early decidability result 
for monadic Datalog from [17], cf. Theorem 8.2 above. 

We note that GN-Datalog is stricly more expressive than 
monadic Datalog, but avoids the dangers of negation that 
render boundedness undecidable, for instance, in the exten- 
sion of monadic Datalog by just inequalities, or by negative 
as well as positive access to some binary EDB predicates. 

Technically, the following decidability assertion is an easy 
corollary to the decidability results for monadic second-order 
logic and guarded second-order logic over tree-like structures 
in ilOj. These results in turn are based on a non-trivial re- 
duction to an automata theoretic decidability result of Col- 
combet and Loding, which, in the relevant strength needed 
here, has not been published yet. As in uO\ we indicate this 
caveat formally as an assumption (ILT), which refers to the 
decidability of limitedness for weighted parity automata on 
infinite trees, as announced in connection with progress on 
earlier work in [16] . 

RecaU that BDDc(GNFO) and BDD/(GNFO) coincide. 

Theorem 8.7 (assuming ILT) Boundedness for GNFO- 
programs is decidable. 

Proof. The GNFO-formulas in an GNFO-program can 
be translated into explicitly guarded formulas of guarded 
second-order logic GSO (denoted GSO* in [lO]). The re- 
sult then follows from the decidability of BDD(GSO*, Wfc), 
boundedness for GSO* over the class of structures of 
treewidth k, where both the GSO*-formulas and the param- 
eter k are treated as input (Theorem 8.8 in 10 ). We apply 
this to the GSO*-translation of the input GNF O-pr ogram II 
over the class Wfc for k : 
is a valid reduction. D 

Towards our interest in GN-Datalog, with its stratified 



>r(n)|2 By Lemma [ail (")> this 



use of guarded negation as defined in Section 
the notion of boundedness from Definition |8T 



4] we extend 
as follows. 



Definition 8.8 A GN-Datalog program II — {Ili)i<n is 
called fully f/c-bounded over a class of instances I if each 
stratum Hi is f/c-bounded over the class of all instances 

*NB: since II really corresponds to a system of least fixed 
points in several IDB predicates X, we need the re sult for 

as 



systems of simultaneous fixed points in GSO* from 
discussed in the proof sketch for Theorem 11.5 there' 
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obtained from instances in X by evaluating all IDB pred- 
icates from lower strata according to n<i and treating 
them as EDB for Hi. Equivalently, a GN-Datalog program 
n = {J\-i)i<t is fully f/c-bounded if there are natural num- 
bers ki, . . . ,kt such that for all finite/unrestricted instances 

/,n°°(/) = n^(n^l-(...n^ (/)...)). 

Corollary 8.9 (assuming ILT) For GN-Datalog, full f- 
boundedness is decidable and coincides with full c- 
boundedness . 

Proof. The proof is by induction on the number of 
strata. Note that by definition of full boundedness, a strat- 
ified GN-Datalog program 11 — (ni)i<„ fails to be fully 
bounded if, and only if, there is a least stratum m <n such 
that Ylrn is unbounded over the class of instances obtained 
by evaluating all IDB predicates of lower strata according to 
n<m. Since these lower strata are bounded, this partial eval- 
uation is in fact GNFO-definable. It follows that the above 
arguments concerning the GNFO-variant of the Barwise- 
Moschovakis theorem and its finite model theory version 
carry through - stratum by stratum, and up to the first 
stratum that turns out to be unbounded, if any. This also 
reduces the decidability claim to that in Theorem |8.7[ D 

We remark that the passage through boundedness for 
GSO* over Wd, which is known to be of non-elementary 
complexity even for fc = 1, prevents us from extracting any 
reasonable complexity bounds. It is conceivable, of course, 
that alternative methods yield such bounds (as is the case 
for other special cases of interest, besides that of monadic 
Datalog, that also follow from the master result of [10| ). 

9. DISCUSSION 

9.1 Further extensions of GN-SQL 

Inequalities. GN-SQL can be viewed as a well-behaved 
query language extending unions of conjunctive queries with 
a restricted form of negation. In this sense, it is natural 
to compare GN-SQL to UCQ(^), the language of unions 
of conjunctive queries with inequalities. Like GN-SQL, 
UCQ(7^) is computationally well-behaved: query contain- 
ment is Ilj-complete |27, 38 , the combined complexity of 
query evaluation is NP-complete, and the data complexity 
of open world query answering is NP-complete w.r.t. a large 
class of constraints [19], cf. also [30|. In this light, and in the 
light of Figure [3] the question arises whether we can extend 
GN-SQL to allow for the use of (unguarded) inequalities. 

Let us denote by GN-SQL(/) the extension of GN-SQL 
where conditions may make use of the inequality relation 
(t^), but the inequality relation cannot be used to guard 
negations. It is easy to see that Theorem |7.1| extends to 
GN-SQL(7^) — we may view the inequality as just an- 
other relation that is part of the input instance. All the 
other results we obtained for GN-SQL, however, fail for GN- 
SQL(7^). This follows from the fact that it is possible to ex- 
press functional dependencies in GN-SQL(7^). Indeed, every 
functional dependency 

Vxyzuii(_F(x, y, u) A -F(x, z,v) ^ u = v) 

is equivalent to the GNFO sentence with inequality 

^3xyz, u, w(F(x, y, u) A -F(x, z,v) Au ^ v) , 



which can easily be expressed in GN-SQL (7^) as well. Re- 
call that inclusion dependencies too can be expressed in GN- 
SQL. This, together with classical results in dependency the- 
ory (cf. 1 ) and Theorem 7.8 ii)), implies the following: 



Theorem 9.1 (i) GN-SQL(^) is not finitely controllable 
for satisfiability or query containment, 
(ii) The satisfiability and query containment problems for 
GN-SQL(^) are undecidable (both on finite instances 
and on unrestricted instances) . 
(iii) There is a GN-SQL(^) query for which open world 
query answering is undecidable. 

Known results for various description logics contained in 
GNFO imply that OWA answering for GNFO (7^) queries is 
undecidable when the query is part of the input [34] . The- 
orem |9.1| strengthens this by showing that the problem is 
undecidable already for a fixed GNFO (7^) query. Naturally, 
similar results can be obtained for the extension of GN- 
Datalog with inequalities. 

Constants and Comparisons. GN-SQL queries, as we de- 
fined them, cannot contain constant values, nor arithmeti- 
cal comparisons (i.e., conditions of the form ti < ^2). In- 
deed, over linearly ordered domains, inequalities can be ex- 
pressed using arithmetical comparisons (a; 7^ y is equiva- 
lent to X < y y y < x), and hence, by Theorem 9.1 most 
problems immediately become undecidable when arithmeti- 
cal comparisons are allowed. However, as we will show, our 
results do generalize to the extension of GN-SQL where (i) 
queries may contain constant values, and (ii) arithmetical 
comparisons of the form fi < f2 are allowed provided that 
at least one of ti,t2 is a constant value. 

In what follows, let lin — (D,^) be any ordered domain 
(where _D is a countable set and ^ is a total order on D) 
that is "reasonable" in the sense that the following problems 
are all solvable in polynomial time (for some appropriate 
representation of the elements of D): 

1. given di,d2 G D, is it the case that di -< d2? 

2. given d £ D, does there exist d' £ D with d' -< d? 

3. given d £ D, does there exist d' £ D with d < d'l 

4. given di,d2 £ D, does there exist d' £ D with di -< 
d' ^ da? 

Essentially, all the usual ordered domains, such as the 
natural numbers (N, <), the rational numbers (Q, <), and 
the strings {A* , <iex) over a finite ordered alphabet A, are 
reasonable in this sense. 

Let GN-SQL(lin) be the extension of the GN-SQL syntax 
where (i) all terms t are allowed to be either of the form 
_R.ATTR (as before) or to be an element of lin (in which case 
we call t a constant); and (ii) for all terms fi,f2 of which at 
least one is a constant value, ii < t2 is allowed as an atomic 
condition. The semantics of GN-SQL(lin) queries is only 
well-defined for instances whose active domain is a subset of 
LIN. Therefore, we restrict attention to such instances. 

All results for GN-SQL that we have presented can be 
extended to GN-SQL. For simplicity, we sketch the relevant 
construction here only for the query containment problem. 

Theorem 9.2 Let lin be any reasonable ordered domain. 
GN-SQL(lin) query containment is SExpTime-complete. 



Aggregation. Recall that GN-SQL does not allow for any 
form of aggregation that is available in SQL. This is for good 
reason: allowing even simple forms of aggregation such as 
counting would quickly lead to undecidability, since query 
containment for unions of conjunctive queries under the bag 
semantics is undecidable 25 . 

9.2 Further Extensions of GN-Datalog 

Allowing IDBs As Guards. If in the definition of negation- 
guarded Datalog rules one permits also the use of IDB atoms 
from the same or lower stratum as guards, this can result 
in an exponential gain in succinctness but does not increase 
the expressive power. (A simple induction on strata and 
on stages of inductive definitions of IDB predicates confirms 
that all tuples added to the interpretation of IDB predi- 
cates are guarded by some EDB atom.) Query evaluation 
complexity, however, suffers an exponential blow-up as a 
consequence of this relaxation. 

Proposition 9.3 (GN-Datalog writh IDB guards) 

Answering GN-Datalog queries with IDB atoms allowed as 
guards is ExpTime-complete m combined complexity. 

Capturing the Alternation-Free Fragment of GNFP. 
In [20], an extension of Datalog-LIT was presented, called 
Datalog-LITE, which includes "generalized literals" and was 
shown to capture the alternation-free fragment of guarded 
fixed point logic (GFP). We expect that GN-Datalog can be 
similarly extended, in order to subsume Datalog-LITE and 
capture the alternation-free fragment of GNFP. 
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APPENDIX 

A. MISSING PROOFS 

A.l Proof of Theorem |2.2| and inexpressibility 
claims 

Proof. From GN-RA expressions to GNFO formulas, 
there is a straightforward inductive linear translation. More 
precisely, the following table describes how to translate each 
GN-RA expression E of arity k into an equivalent (therefore 
domain-independent) GNFO formula (P_e(xi, . . . ,Xk)- 
R R{xi,...,Xk) 

Tii...i„(-B) 3z¥>b(z) A AJli^j = 2ij 

E X E' ifiEi^i, ■ ■ . ,a:fcj A ipsixki+i,- ■ . .I'fci+fcj) 

EnE' ifiEixi,---,Xk) ^iPEixi,---,Xk) 

EUE' (/^^(zi, . . . ,Xfc) V 93e(xi, . . . jXfc) 

For the converse direction, we proceed as follows; we first 
construct a GN-RA expression ADOM that defines the ac- 
tive domain of the instance (the union of all unary pro- 
jections of atomic relations). Next, for each sequence of 
variables x = xi,...,Xk and for each atomic GNFO for- 
mula (j> whose free variables are included in x, we compute a 
fc-ary GN-RA expression trx(<^) that is equivalent to it un- 
der the active domain semantics (i.e., over structures whose 
domain coincides with the active domain). For instance 

tra,i,a,2,a;3(-R(3;2,3;2)) = TTl, 2,40-2=3 (ADOM X RX ADOM), and 

tr3,i,a:2,a;3(2;i = X2) = 7ri,i,2(ADOM X ADOm). Finally, the 
translation trx(-) is extended to complex GNFO formulas. 
Conjunction, disjunction and existential quantification are 
translated as intersection, union, and projection. Hence, the 
only remaining case is for trx(a(y) A^(/>(y)), where the vari- 
ables in y are included in the variables in x. If the guard 
a is a relational atom, trx(a(y) A ^<^(y)) can be defined as 
the GN-RA expression obtained from try(Q) — try((/>) by (i) 
pulling out selections and projections as necessary in order 
to turn the first argument of the complementation opera- 
tor into a projection of an atomic relation; and (ii) taking 
a product with ADOM for each variable from x that is not 
included in y. 



If the guard a is of the form yi = y2, then try(Q(y) A 
^<^(y)) is defined, in the first instance, as 7ri,i(ADOM — 
7ricri=2try((/))). Since ADOM is in general a union of all unary 
projections of atomic relations, we need to pull out the union 
from the scope of the complementation operator. This is 
where an exponential blow-up may be incurred. D 

Proposition A.l The following RA expressions are not 
equivalent to GN-RA expressions: 

1. {ni{R) X S) - m,i{R} 

2. ■K1,a{(T2=3{R xR))-R 

3. tvi{R) - 7ri((7ri(i?) X S) ~ R) 

Proof. In [t], the notion of GN-bisimulation was intro- 
duced, and it was shown that GN-bisimulations preserve 
the truth of GNFO sentences. Together with Theorem [2!2] 
this allows us to show non-expressibility of the above RA 
expressions in GN-RA. Note that if any of the above RA 
expressions was definable in GN-RA, then also its boolean 
projection would be definable in GN-RA. It can be shown 
that 

1. the instance {R{a,b),S{a),S{c)}, which satisfies the 
boolean projection of (ni{R) x S) — ni^i{R), is GN- 
bisimilar to the instance {R{a,b),S{a)}, which does 
not. 

2. the instance {R{a,b),R(b,c),R{a,c),R{b,d)}, 
which satisfies the boolean projection of 
7ri,4((T2=3(.R X R)) — R is GN-bisimilar to the in- 
stance {R{a, b), R{a, c), R(b, c)}, which does not. 

3. the instance {R{a,b),S{a),S{c)}, which satisfies the 
boolean projection of (7ri(i?) x S) — 7ri,i(i?), is GN- 
bisimilar to the instance {R{a,b),S{a)}, which does 
not. D 

A.2 Proof of Theorem O 

Proof sketch. Let q be any (closed) GN-SQL query. 
We may assume without loss of generality that each tuple 
variable R occurring in q is declared in exactly one from- 
clause, and therefore has a unique associated relation name, 
that we will denote by rel_r. By a simultaneous induction, 
we can 

• translate each (open or closed) GN-SQL query g to a 
GNFO formula ^^(x), where x is a sequence of first- 
order variables, one for each attribute name belonging 
to the type of the query q and one for each term i?.ATTR 
where _R is a tuple variable that occurs freely in q and 
ATTR is an attribute name belonging to the type of 
RELr, 

• translate each GN-SQL condition c to a GNFO formula 
0c (x), where x is a sequence of first-order variables, 
one for each term i?.ATTR where _R is a tuple variable 
that occurs freely in c and ATTR is an attribute name 
belonging to the type of rel_b. 

We omit the detailed definition of the translation, which is 
straightforward. The clause for not is as follows: 

0not(conds(ion)(x) ~ RELii(x) A ^<^condstsora(x) 

It is not hard to see that each closed GN-SQL query q is 
equivalent to its GNFO translation <;/>,. In particular, this 
implies that <j}q is domain independent. 



For the converse translation, from domain-independent 
GNFO formulas to GN-SQL queries, it is convenient to first 
assume that we have at our disposal a relation ADOM with 
a single attribute A containing all elements belonging to 
the active domain. As we will show, using such a relation, 
it is quite straightforward to give an inductive polynomial 
translation from GNFO to GN-SQL. On the other hand, 
all usage of ADOM can be eliminated at the cost of an ex- 
ponential blow-up. To see this, recall that relation names 
can only appear in FO-SQL queries in the from-clause of a 
select-from-where expression. Thus, any occurrence of ADOM 
must be of the form 

select a from (. . . , ADOM R, . . . ) where /3 

where, in addition, the expressions a and /3 may refer to 
R.A. We may equivalently replace such an expression by 
the union of all expressions of the following form (for all 
relation names RELi and attribute names ATTRj): 

select a from (. . . , RELi R, . . . ) where /? 

where a' and /3' are obtained from a and /3 by replacing 
all occurrences of R.A by _R.ATTRj. Clearly, applying this 
transformation for all occurrences of ADOM yields an equiv- 
alent query that does not make use of ADOM and that is at 
most singly exponentially larger than the original query. 

Next, we explain how to translate domain-independent 
GNFO formulas to GN-SQL queries with the help of the 
ADOM relation. Let 0(x) be any domain-independent GNFO 
formula. We assume w.l.o.g. that </!> does not reuse any vari- 
ables, and associate to each first-order variable z a corre- 
sponding distinct tuple variable Rz (whose type, in the ex- 
pressions below, will consist of a single attribute named A). 
Next, we inductively translate each GNFO formula (j) to a, 
GN-SQL condition 0*, as follows. 



(x = yr 

R,EL(a;i, . . . ,x„) 



{R:,.A = Ry.A) 

exists(select R.Ai from REL R where 

R.Ai = _R^i .A and . . . _R.A„ = R^„ .A) 

(0AV)* =0*AV* 

{(pv i^)* = 0* V V* 

{3x (f>)* = exists(select Rx-A from ADOM Rx where <f>*) 

(aEL(x) A -'0)* = exists(select R.Ai from REL R where 

R.Ai = Rxi.A and . . . and R.A„ = Rx„-A and not{ij>*)) 
(x = y A -^<f>Y = (Rx-A = Ry.A) and not 4>[x/y\* 

where, in the second clause and in the 6th clause, the schema 
of the relation REL is {Ai, . . . , A^}, and where, in the 6th 
clause, (p* is obtained from (p* by replacing each term Rx^.A 
by R.Ai. In the last clause, (j>[x/y] is the formula obtained 
from (j) by replacing each free occurrence of x by y, so that 
the formula in question has only one free first-order variable. 
Finally, starting with a GNFO formula (j}{xi, . . . ,x„) we 
define the query q^ as follows: 

select Ri.A as attri Rn. A as attr„ 

from ADOM Ri ADOM R„ where (j)* 

where ATTRi, . . ., ATTR„ are distinct attribute names. It is 
easy to show that each domain-independent GNFO formula 
is equivalent to the GN-SQL query obtained from it in the 
above way. D 

A.3 Proof of Theorem U 

Proof. Consider any non-recursive GN-Datalog query 
q — (n, Ans) with 11 — (ni,...,n„). A straightforward 



induction on k shows that, for every k < n and for every 
X e 106"*=, there is a GNFO formula (j) that defines the 
relation computed by X. The GNFO formula in question 
can be obtained by taking the disjunction of all bodies of 
rules that have X in the head, replacing all occurrences of 
IDBs Y G IDB^* with ^ < fc by their (previously obtained) 
defining GNFO formulas. Finally, by taking the query Ans 
and replacing each IDB X G IDE"" by its defining GNFO 
formula, we obtain a GNFO formula that is equivalent to q, 
and, in particular, domain independent, since q is domain 
independent. 

Conversely, let 0(x) be a domain-independent GNFO for- 
mula. We may assume that ^(x) is in DNF. This may re- 
quire an exponential blow-up. First, we construct a GN- 
Datalog program with a unary IDB ADOM that computes 
the active domain, as well as a binary IDB X= that com- 
putes the relation {{x, x) \ x £ ADOm}, which will be used for 
translating equality statements. We omit the construction, 
which is straightforward. Next, by induction, we construct 
for each subformula of (j) that is of the form a A ^x ^ non- 
recursive GN-Datalog program with IDB relations Xcax ^'id 
-'faA^x computing the relation defined by a A x and a A ^x 
under the active-domain semantics (i.e., on structures whose 
active domain is the entire domain). In particular, if a is a 
relational atom, then the program includes the rule 

Xqa^xW <- q(x) a ^XaAx(x) 

If a is an equality statement, we proceed similarly, using the 
X^ IDB relation introduced above as a guard. 

Finally, if (^(x) is of the form 0i(xi) V ... V (/f>„(xn), 
we define Ans to be the union of the conjunctive queries 
Ansi{x) := (X^. (xi) A /\^g^ ADOM(a;) ). From the domain- 
independence of ^(x), we obtain that q = (11, Ans) is equiv- 
alent to (f). n 

A.4 Proof of Theorem |0] 

It was shown in [14] that satisfiability for UNFO is 
2ExpTime-hard, where UNFO is a syntactic fragment of 
GNFO. As we explain below, the construction can be 
adapted to prove the same lower bound result already for 
GNFO formulas in DNF. In addition, we can easily en- 
sure that the GNFO formulas in question are domain in- 
dependent, and force the existence of a fixed unary predi- 
cate denoting the active domain. Under these restrictions, 
the translation from GNFO to GN-SQL (Theorem [O and 



the translation from GNFO to non-recursive GN-Datalog 
(heorem 4.4 1 both runs in polynomial time. Hence, we ob- 



tain 2ExpTime-hardness for satisfiability, and therefore also 
for query containment, for GN-SQL and non-recursive GN- 
Datalog. 

Proposition A. 2 There is a fixed schema such that the sat- 
isfiability problem for GNFO formulas in DNF is 2ExpTime- 
hard, both on arbitrary structures and on finite structures. 



Proof sketch. The same result 
quirement, was shown in 



14 



without the DNF re- 
in the context of a fragment 
of GNFO called UNFO. 'V^^briefly sketch the construction 
used in the proof in 14 , and explain how it can be adapted 
to use only GNFO formulas in DNF. 

Fix an alternating 2"-space bounded Turing machine M 
whose word problem is 2ExpTime-hard. Let w he a. word in 
the input alphabet of M. We construct a formula 4>-w that 



is satisfiable if and only if M accepts w. Moreover, if (p^ 
is satisfiable, then in fact it is satisfied in some finite tree 
structure. In this way, we show that the lower bound holds 
not only for arbitrary structures, but also for finite trees 
and for any class in-between. The formula (j>w describes an 
(alternating) run of M starting in the initial state with w 
on the tape, and ending in a final configuration. 

The run is encoded as a big tree whose nodes correspond 
to configurations, whose child relation correspond to suc- 
cessive configurations, and where each node has in addition 
a small subtree of height n attached to it, that is used to 
describe the tape content at that configuration. Here is an 
illustration of a configuration with two successor configura- 
tions (but we allow more than two successor configurations): 




Each small subtree has depth exactly n. The internal 
nodes of the subtree are label with a unary predicate P. 
Hence a path from its root to one of its leaf correspond 
to a bit string of length n denoting a position of the tape. 
The label of the leaf codes the content of the tape at that 
position. 

The formula first enforces that the small subtrees have 
the desired structure and that all positions are realized in 
at least one leaf of each small subtree. Since we don't have 
inequality, we cannot force that it is realized exactly once, 
but we can force in GNFO that all nodes where it is realized 
satisfy the same relevant unary predicates A: 

^3a:y(leaf(2:) A leaf(j/) A x t"!" y A 

/\{Pix)^P(y))AA{x)A^A(y)) 

i 

Here, x t"i" y is a short for the GNFO formula describing 
the fact that there is a path of the form '[""J^" from a; to y, 
leaf(a;) is a short for SyRxy, and Piix) is a shortcut for 
3y{xr~'yAP{y)). 

The following formula suc(a;, y) expresses that x and y 
denote the same tape position in successive configurations, 
and it uses only unary negation: 

s.uc{x,y) ~ leaf(x)Aleaf(y)A(a;t"+U"+'y)A/\(P4a;)^P»(j/)) 



Note that the first half of the formula says that x and y are 
tape cells of successive configurations. Using this formula, 
we can specify all relevant properties of the run (the encod- 
ing will involve formulas of the form V2;(leaf(2;) A ^(a;) — > 

3j/(suc(a;,j/) A V'(y))))- 

This concludes the outline of the construction used in 



14 



for showing 2ExpTime-hardness of the satisfiability problem 
for (a fragment of) GNFO. 

The above proof clearly uses GNFO formulas that are not 
in DNF, and the straightforward way to bring the formulas 
in DNF, by "pulling out disjunction", would lead to formu- 
las whose length is exponential in n. The problem, here, 
lies in the formula /\.{Pi{x) -H- Pi{y)) expressing that two 



leaf nodes, in the same configuration or in successor con- 
figurations, encode the same memory location. This use of 
disjunction can be avoided using a construction from [9]. In 
particular, we enrich our encoding of Turing machine con- 
figurations as follows: to each node x of the structure, we 
attach a small substructure consisting of nodes that we mark 
with a fresh unary predicate Q in order to distinguish them 
from the nodes that belong to the "main structure" (i.e., 
the structure as it was before adding all these new small 
substructures). The exact substructure that we attach to 
a node x depends on whether or not the node satisfies P. 
If a node x it satisfies P, we create a new node y and add 
edges E{x,y) and R{x,y). If, on the other hand, x does 
not satisfy P, we create new nodes y and z and add edges 
E{x, z), R{x, y),R{y, z). Here, _E is a new binary predicate. 
This modification of the structure has the consequence that 
we can avoid the use of disjunction in comparing whether 
two leaf node encode the same memory location: suppose 
that X and y be leaf nodes of the same configuration sub- 
tree. Then {Pi{x) -f-*- Pi{y)) can be equivalently expressed 
as 

3uu'vv'{x t"'' uAE{u, u')Ay t""' vAE{v, v')Au f+^\,'+^ v') 

In a similar way, we can express, without using disjunction, 
the fact that two nodes encode the same memory location in 
successive configuration subtrees. We omit the details. D 

A.5 Proof of Theorem O 

Proof. Upper bound: for each IDE except possibly the 
answer IDB, the number of tuples that may end up in the 
extension of the IDB is bounded by the number of tuples 
belonging to the extension of the EDBs, times the num- 
ber of rules of the Datalog program, because each rule is 
guarded by an EDB (here, incidentally, what really mat- 
ters for the argument is the body of each rule includes an 
EDB atom that contains all variables occurring in the head 
of the rule). Hence, the number of times a non-answer rule 
is applied is bounded by the number of facts in the input 
database instance times the number of rules of the Datalog 
program. Hence, the entire Datalog computation, except 
for the computation of the answer relation, can be viewed 
as a polynomial computation with an NP-oracle (for evalu- 
ating the bodies of rules). Finally, once all IDBs except the 
Answer IDB have been computed, we simply invoke the NP 
oracle once more to test if the given tuple belongs to the 
extension of the answer IDB. 

For the lower bound we provide a reduction from 
the LEX(SAT) problem: given a propositional formula 
"I>(a;i, . . . ,Xn), determine if the value of a;^ is 1 in the lexico- 
graphically least satisfying assignment, where Xn is the least 
significant bit. where x„ is the least significant bit. The 
LEX(SAT) problem is known to be P^'^-complete, even for 
3-CNF formulas J40]. 

We devise a structure 58 with a domain of two ele- 
ments (T and _L) endowed with a unary relation T that 
is true only of T, a unary relation F that is true only 
of _L, a binary relation A'' that holds precisely the com- 
plementary pairs (-L, T) and (T,_L), and a ternary rela- 
tion OR that is true of all {_L,T}-triplets but (_L,_L,_L). 
This way, the set of satisfying assignments to every 3-clause 
C{xi,Xj,Xk), e.g. Xi V -^Xj V -'Xk, is the answer set to a 
corresponding conjunctive query C{xi,Xj,Xk) on !B, such as 



^VjVk N{xj,yj) A N{xk,yk) A OR{xi, yj,yk) in this exam- 
ple. More generally, we can translate every 3-CNF formula 
<l?(a;i, . . . , Xn) into a Datalog rule with body 



<E>(a;i,. 



,2/1, 



.,y„) = /\iV(a;„yO A /\ C(x,y) 



where C ranges over the clauses of $. 

Given a propositional formula $(a;i, . . . ,a;„), the idea is 
now to have, for each i < n, a unary IDB predicate Xi that 
computes the truth value of the i-th bit in the lexicograph- 
ically least satisfying assignment to ^{xi, . . . ,Xn). These 
IDBs belong to different strata of the program as inductively 
defined by the following rules. 



Xi(a;i) 

Z^ 

Xi(xi) 

X-i{Xi) 

Ans 



F(xi), l>(x,y) 
F{xi), Zi 
T{xi), -Zi 

Xi(xi),...,X,_ 
F(x,), Z,, 
T{x,), ^Z, 



i(xi_i), F(xi), l>(x,y) 



X„(a;„), T{xn), ^{x.y) 



It is easy to see that the above non-recursive GN-Datalog 
query computes on ?B the solution to the LEX(SAT) problem 
instance $. D 

A.6 Proof of Proposition |7.3| 

In what follows it will be convenient to work with GNFO 
formulas in disjunctive normal form. Two critical dimen- 
sions of a GNFO formula in DNF are its 'width', intro- 
duced above, and its 'negation rank'. The negation rank 
nrank((jf)) of in DNF is the maximum number of nested 
negations in (j>- Naturally, UCQs have negation rank 0. Set 
DNF;;, = {4>\(l)m DNF, width(<^) < w, mank{(t)) < r}. 

Given a structure U we let atoms(Lf) denote the set of 
tuples u forming the support of a relational atom in U. 
For every u G atoms(Z7) and u' G atoms(L''') and for ev- 
ery w,r £ N>o let U,u =^ U',u' denote the fact that 
U \= i:{u) <^^ U' \= i'{u') for every i/; G DNF^. As- 
suming an ambient finite relational signature, each =^ is an 
equivalence of finite index as there are, up to logical equiv- 
alence, only finitely many formulas in DNFJ^. In particular, 
for every u G atoms(C/) there is a formula Xu,\i{^) that 
is a boolean combination of DNFJ^-formulas and is char- 
acteristic of its =J^-class, i.e. such that U,u =^ U',u' iff 
U' \= X!7,u(u'). We call xuA^) the DNF^-type of u m U. 
Note that Xu,\ii^) is itself not in DNFJ^. Also note that 
DNFJJ, is empty and that DNF^ comprises only UCQs. 

Proof. For the purposes of this construction we consider 
the ammendment of the language of fix) with constants c 
corresponding to the free variables x, and regard ip as the 
sentence obtained from the original query by substitution of 
each constant C; in place of the corresponding free variable 
Xi. Accordingly, given an instance / with distinguished ele- 
ments a, we treat a as the interpretation of the constants c. 
Furthermore, we assume w.l.o.g. that (j> is in DNF as in (ml 
on page [land let w be the width and r the negation rank of 

</>• 

For each DNFJ^-type r(x) such that (/)Ar(x) is satisfiable 
we fix in advance, and independently of /, a finite model of 
(jf> with distinguished elements (M^,a^) realising it: Af^ j= 
(j> A T(a'^). Let C be the maximum number of facts in any 
of the M^. Note that C is independent of /. 



To obtain the model M, for every b G atoms(/) having 
DNFJ^-type r(z) in J we take {M^, a^) to be a fresh copy of 
(A'/^,a^) and attach it to / by identifying its distinguished 
tuple a'' with b component- wise. Thus M is made up of at 
most Cn many facts. It remains to verify that Ad |= (j). 

Claim 1 For every b G atoms(/) and d G atoms(A'/'') we 
have Af ^ d =1, M, d. 

From this claim it follows trivially that M =^ M^, there- 
fore also M =J^ J, since M^ =J^ J by choice. Because 
J 1= 0, this will allow us to conclude M \= (f). 

To establish Claim[l]we prove by induction on g = 0, . . . , r 
that Af'',d =1, A/, d for every b G atoms(7) and d G 
atoms(Af''). The latter claim is trivially true for q — 0. 
Towards the induction step assume it is true for q — 1 and 
consider an arbitrary b G atoms(7) and d G atoms(Af''). It 
suflices to show that M^ \= ip{d) <;=> M |= i/)(d) for all 

V(x) = 3y /\(q,(z')A^VKz')) 
I 
where each ai{z^) is an atomic formula and ipi G DNF^"""^ 
with free variables z' from among xy. For ip as above we 
additionally define i^(x, y) — /\j(ai(z') A ^?/'i(z')). 

Tackling first the easy direction, suppose that M^ \— V'(d) 
and consider witnesses c in M^ such that M^ \= !/(d,c). 
Then M \— ai{e') is immediate for each subtuple e' that 
relates to do as z' relates to xy, while M \= ~:ipi{e') follows 
from M^ \— -^ipi{e') via the induction hypothesis. This 
proves M |= ip{d.). 

Suppose now M j= V'(d) and let c be elements of M such 
that M \— !^(d,c). Our aim is to find witnesses c' in M^ 
such that M^ |= !/(d, c'). We distinguish two cases, 
(i) If c lies entirely in M*' then, using the induction hypoth- 
esis as in the proof of the opposite direction, we can confirm 
that c' = c are appropriate witnesses: M^ |= v{d, c). 
(ii) Otherwise we proceed as follows. For each atom a;(z') 
from ly let e' be the subtuple relating to dc as z' relates to 
xy. Thus M \= ai{e') A ^V'i(e') for each /. 

Next, for each a G atoms(/) let A(a) be the set of those 
indices / such that e' lies entirely in Af" and let e''*^ enu- 
merate (without repetition) all elements from those e' with 
I G A(a). Let in addition 5^ be the conjunction of all those 



formal equalities Oj 



.(a)^ 



that hold in M. Further let 



Thus, M 
also M"" h 



g^(^) a,(e') A ^V!(e') and i/-" : 
1= i^'*(a, e'*^') and we find. 



Be^'^)! 



(i), that 



a,e 



(a)^ 



hence M'^ 






J, a, we also learn that J 

(a)^ 



'^''(a) and, because 
: ip'^la). So there are 



u^''^ in J such that J |= i^''(a, u 

Let u enumerate all u'"*' for a G atoms(/) different from 
b. Note that, crucially, |u| < w. Indeed, because for each a 
elements of a and u'^'^' satisfy the equalities prescribed in 5^, 
it is ensured that any equalities between elements of witness- 
ing tuples e*'*' in W^ and e'"*'' in M'^' with a / a' (which, by 
definition of M , must necessarily involve elements occurring 
both in a and in a') are also observed by the corresponding 
witnesses u''*' and u''* ' in J. 

Altogether we have J j= Aa^^b ^''('^, u ), where a ranges 
over atoms(/) \ {b}. Let 5^ be the conjunction of all 
equalities hj — Uk that do hold in J, and let (^(b, u) = 
S^ A AaT^b AieA(a) '^'('^O ^ ^V'i(e'), where a ranges over 
atoms(/) \ {b}. By the above, J \= 3u^(b, u) and 



3uC(b,u) G DNF^, so from M^,h =1, J,h it follows that 
M^ \= 3u^(b, u). Taking into account that 5 stipulates all 
equalities between components of b and any witnesses u to 
^ in M^ that are valid in J and, correspondingly, that are 
valid in A/ between elements of b and the original witnesses 
c to '0 in M, we can conclude from this and from of course 
M*' \= i>^(d) that M^ |= TJj{d) as needed. This completes 
the induction step in the proof of Claim [l] D 

A.7 Proof of Theorem |7J] 

In the proof below, we concentrate on boolean queries. 
The PTime upper bound for boolean queries extends im- 
mediately to non-boolean queries: consider a fc-ary SGNQ 
(f){xi, . . . , Xk)- Let Pi, . . . ,Pn be fresh monadic predicates. 
Then the problem of testing whether I \=owa (t>{ai, . . . ,ak), 
for given I and ai, . . . , a^, reduces to the problem whether 

/' \=OWA Jxi,. . . ,Xk{(p{xi, . . . ,Xk)/\Pl{xi)A- ■ ■ APkiXk)), 

where I' extends I by interpreting each new predicate Pi by 
the singleton set {at}. 

Proof. Consider a boolean SGNQ Q. By prescription, 
conjunctions under an even number of negations in Q may 
contain at most one conjunct that is a negated subformula. 
To allow for a uniform treatment we introduce a fresh nuUary 
predicate false and add -if alse as a new conjunct in those 
subformulas of Q (whether in the context of an even or odd 
number of negations), where there were no negative con- 
juncts. After this trivial transformation Q takes the form 



false V Y3x(Qi(x) A^3y Vi(x,y)) 



(4) 



where each Oi is a conjunction of atoms and ipi is a DNF fro- 
mula built with only 3, A and guarded negation. We allow 
above |x| = and at to be an empty conjunction, i.e. vac- 
uously true. Thus, (J4| contains, as a special case, disjuncts 
of the form ^3y ip{y). As another special case, B may 
contain disjuncts 3x [ctii'Si.) A ^f alse) with il>i — false and 
the corresponding quantification 3y being vacuous (jy| =0). 
We shall write Q equivalently as 



/\Vx(a,(x) -5> 3y Vi(x,y)) -^ false 



(5) 



akin to a formulation of CQ entailment of frontier-guarded 
tgds - except for the fact that V'i(x, y) need not be quantifier 
free. Via induction on the quantifier alternation rank we 
show that (|5| can be 'fiattened' to an equi-satisfiable V3- 
formula of GNFO asserting that a conjunction of frontier- 
guarded tgds entails false. 

Each ^i is in disjunctive normalform and occurs in the 
scope of an odd number of negations in the disjunction-free 
(j>. Hence it assumes the following general form 

V'i(x,y) = 7i A /\^<5i,m A /\^3z (7i,„ A^Ci,„) 

m n 

where 7i and 7^' „ are conjunctions of positive atoms, each 
(5i,m is an atom and each ^i,,„ is either an atom, or an ex- 
istentially quantified formula, or false. Let W^i(x, y) be a 
new predicate symbol of the same arity as Vi- We replace 
in ([5| the i-th conjunct with a collection of new conjuncts: 

Vx (a,(x)^3y W^,(x,y)) 
Vxy (VK.(x,y)^7,) 

Vxy (Wi^^iy) ^ '^■',"1 "*■ f^ls^) for each (5i,m 
Vxyz (W^i(x, y) A 7i,„ -> ^i,„) for each 7i^„ 



Because all negations were properly guarded, all of these 
rules are frontier-guarded tgds, save perhaps some of those 
of the last kind with ^ an existentially quantified formula, 
which then pertain to the same restrictions as the original 
conjuncts of (|5|, only having a lower quantifier alternation 
rank. Iterating this transformation one eventually arrives at 
the desired form comprising only frontier-guarded tgds. 

The theorem follows from the fact that OWA query an- 
swering against frontier-guarded tgds has PTime data com- 
plexity [5]. In fact, [4] shows that every CQ can be rewritten 
relative to a set of frontier-guarded tgds into a Datalog pro- 
gram that can be executed on a database instance to yield 
the OWA answer to the original query. Thus, each serial 
GNFO query Q can also be reformulated as a Datalog pro- 
gram (n, false) such that / ^owa Q <=> n(/) \= false 
for all instances /. D 

A.8 Proof of Theorem [Tj] 

Proof. We combine ideas of [22] and [ll| Theorem 15] 
to encode computations of a fixed Turing machine with the 
database instances representing input words. Using guarded 
tgds and an egd (in fact, a key constraint), we will force the 
existence of a grid frame onto which a valid computation of 
the Turing machine is charted. The advantage of using tgds 
and egds is the well-known fundamental principle 26 19 31 



that open-world query answering on an instance D relative 
to a set E of tgds and egds reduces to query evaluation on 
the single universal (though generally infinite) chase model 
chase{D, E). While the chase model with respect to guarded 
tgds is always tree- like 11 5], the additional key constraint 



imposed by the egd enforces a grid- like structure of the chase 
model. 

Let M be a Turing machine, to be chosen later, hav- 
ing tape alphabet A, states Q and transition function S : 
QxA-^QxAx {—1,0, 1}. Input words to M will be 
presented as successor-structures: comprising a succ-chain 
of j4-labelled elements. The signature consists of a binary 
relation succ and unary relations Pa for every a G A. We 
assume that the predicates Pa partition the input structure, 
that A contains a special start symbol > labelling only the 
first element and a special blank symbol b labelling only the 
last element of the successor-chain. 

Next we define a set Em of guarded tgds and egds over 
an expanded signature responsible for simulating Af . Ea^ 
has as conjuncts the following guarded tgds (omitting the 
implicit universal quantification of variables in rule bodies). 

succ{x,y) — > 3z succ(y,z) succ{x,y) A P\,{x) — > P\,{y) 

succ{x, y) — >■ 3uu cell{x, y, u, v) 

cell{x, y, u, v) — >■ next{x, u) A next{y, v) A aucc{u, v) 

In addition, Ea/ contains the key constraint 

next{x, y) A next{x, z) ^f y = z (6) 

expressing functionality of next. It is easy to see that the 
infinite chase of any input structure as specified above wrt. 
these guarded tgds and the egd is an infinite grid with succ 
and next acting as horizontal and vertical successor edges 
and whose bottom sitcc-chain is labelled with t>u)b", where 
w is the input word. Consequently, every model of these 
rules embeds a homomorphic image of this grid. 

The next step is to implement, given the grid frame, the 
workings of the Turing machine M using additional guarded 
tgd rules. To this end the we will make use of additional 



unary predicates Sq for every state g G Q of M. Let init be 
the initial state and ace the w.l.o.g. unique accepting state 
of M. To initiate the computation Em specifies 

P>(x) -> Si„it{x) 

and to carry it on "Em contains guarded tgds associated to 
each transition (p, a, q, b, t) £ S. 

cell{x,y,u,v) A Sp{x) A Pa{x) — > Pi,(x) A Sq{u) for t = 
cell{x,y,u,v) A 5*^(0;) A Pa{x) — >• Pi,(a;) A Sq{v) for t = 1 
cell{x,y,u,v) A Sp{y) A Pa{y) -> Pb{y) A Sq{u) for l = -1 

To ensure that tape symbols not affected by a transition 
are copied from one configuration to the next we add unary 
predicates L and R (intuitively, L and 7? mark the positions 
left and right of the head of a configuration, respectively) 
and the following guarded tgd rules. 



succ{x, y) A Sq{x) —^ R{y) succ{x, y) A R{x) 

succ{x,y) A Sq{y) — s> L{x) succ{x,y) A L{y) - 

ceU{x,y,u,v) A i?(j/) A Pa{y) -s> Paiv) 
ceU{x,y,u,v) A L(x) A Pa(x) — s> Pa{u) 



■R(y) 

L(x) 



This completes the specification of the set Em of guarded 
tgds and the single egd responsible for simulating the Turing 
machine M . It should be clear that M accepts a word w if, 
and only if, the corresponding instance D^ satisfies 

Dw,T,M ^owA 3a; Sacc{x) . 

The first claim of the theorem now follows by choice of some 
Turing machine M that accepts an r.e.-complete language. 
For the second claim consider the key constraint (|6| and the 
GNFO query tpM V 3a; Sacc{x), where ipM is the disjunction 
of the negations of the guarded tgds of Em- D 



A.9 Proof of Proposition [83 



Proof. For the equivalence between (v) and (vi) we 
merely note that the stage increments X"^^ \ X" for each 
IDB predicate X are GNFO-definable, for each n e N. Now 
n is classically unbounded if, and only if, for at least one X, 
these formulas are individually satisfiable, for every n £ N; 
and similarly in restriction to finite instances. The finite 
model property for GNFO therefore shows the equivalence. 

(ii) => (iv) and (i) ^ (iii) follow from the characteriza- 
tions of GNFO as a fragment of FO in terms of preservation 
under suitable notions of guarded negation bisimulation {w- 
bounded guarded negation bisimulation), as presented in [t] 
for the classical version and in [33] for the finite model the- 
ory version. These apply since all stages of H (finite and 
infinite, if we admit infinite instances) and especially the 
limit n°° are preserved under w-bounded guarded negation 
bisimulation, if If is of width w. 

(iv) => (vi) is the natural variant of the classical Barwise- 
Moschovakis theorem for GNFO, which may be obtained 
from the classical via the semantic characterisation of GNFO 
as a fragment of FO in [t] . 

We concentrate on (iii) => (iv). Assume that formu- 
las ipxi'X-i) G GNFO define X°° across all finite instances, 
but fail to define X°° = n°°(J) over some infinite in- 
stance 7. Appealing to the form of the rules in II, we as- 
sume w.l.o.g. that 7/)x(x) is explicitly guarded in the form 
^x(x) = y ^[as{-x.s) A iix{ps{-x.))), where every rule in II 
with head predicate X gives rise to one disjunct, and ps is 
the appropriate substitution to match the variable tuple x 



onto the Xs used in that rule (in particular, ai guards all 
free variables in V'x (ps(x))). 

The fact that a tuple of predicates P is a fixed point of 
n is expressible by a sentence x G GNFO (in the signature 
extended with new Px, one for each IDB predicate X in 
X, which may even be used as guards). But for the tuple 
or predicates defined by the ipx, there is even a sentence 
£, G GNFO in the basic (EDB) signature saying that this 
tuple is a fixed point of II: the crucial point to note is that 
these predicate equalities reduce to set inclusions under each 
one of the relevant guards Os (!). If one of the ^jjx failed 
over any infinite instance, then, by the finite model prop- 
erty for GNFO, it would also fail over some finite instance. 
So the %l)x must define a fixed point of II across all, finite 
and infinite, instances. A similar argument shows that this 
fixed point defined by the ipx over an infinite instance 7 
must be the least fixed point X°°. Otherwise, there would 
have to be some other, strictly smaller fixed point P (viz. 
P := X°°). This fact can also be expressed by a sentence 
of GNFO in the signature extended by the new predicate 
letters P. So the finite model property for GNFO would 
again pull this situation down to some finite instance - con- 
tradicting the assumption that the tpx define X°° over all 
finite instances. D 

A.IO Proof of Theorem [Ol 

Proof. It is known that the implication problem for in- 
clusion dependencies and key constraints lacks finite con- 
trollability, and is undecidable both on finite and on unre- 
stricted instances (cf. [l]). It follows that also the satisfia- 
bility and query containment problems for GN-SQL are not 
finitely controllable, and are undecidable both on finite in- 
stances and on unrestricted instances. As for the last item, 
it follows from Theorem |7.8| using the fact that key con- 
straints (being a special case of functional dependencies) can 
be expressed in GN-SQL(/). Specifically, the GN-SQL(7^) 
query for which open world query answering is undecidable, 
is gi union g2 where qi is the boolean GN-SQL query from 
Theorem |7.8[ ii) and Q2 is the boolean GN-SQL (7^) query 
expressing the negation of the key constraint from Theo- 
D 



7.8 :n) 



A.ll Proof of Theorem [921 

Proof (sketch). In translating GN-SQL(lin) to 
GNFO, we have to overcome a discrepancy in the use 
of constants. The constants that may appear in a GN- 
SQL (lin) query are actual values from the linearly ordered 
domain lin. GNFO, on the other hand, allows for the use 
of constant symbols, whose interpretation is given by the 
structure, and may differ between structures. In particular, 
a structure may interpret two constant symbols by the 
same element. In order to overcome this discrepancy, we 
(i) introduce for each element d of lin a corresponding 
constant symbol d, and (ii) we construct a GNFO sentence 
that "axiomatizes" the correct behavior of the constant 
symbols (including the fact that distinct constant symbols 
denote different values). 

More precisely, let lin = (7), -<) and for each finite subset 
S = {di, . . . ,dn} of D, with di ^ . . . ^ d„, let 6^3 be the 
following GNFO sentence, containing a constant symbol d^ 
for each di £ S: 

Vxf 0a:<di V (px^di V 0di<2!<d2 V (f>x^d2 V • ■ • V <?f>a;>d„ j 



where 



Ai<n{^ < di A -^(x = di) A ^(d, <x)) if 3d e D d ^ di 



otherwise 



Px^di = (x = di) A -■(x < di) A -■(di < x) 

A Aj<j(dj < xA^{dj =x)A^{x < dj)) 
A ^J>^{x < dj A -(d,- = x) A -{dj < x)) 



Pdi<a:<di + 1 



Aj<i{dj <X/\ ^(dj = X) A ^(X < dj)) 

A Aj>.>(a; < dj A -(d, = x) A -(dj < x))) 

if 3d £ D di ^ d^ di+l 
L otherwise 

Ai<„(di <xA^{x = di) A ^{x < di)) ii 3d & D dn -< d 
± otherwise 



Observe that this set 6s can be constructed from S in 
polynomial time, since LIN is reasonable. Furthermore, the 
following crucial property holds: if M is any structure sat- 
isfying 6s, and if M' is an isomorphic copy of AI in which 
each constant symbol di denotes the actual corresponding 
value di (for all di € 5), then M and M' are indistinguish- 
able with respect to GN-SQL(lin) queries whose constants 
are included in 5. It follows that a GN-SQL(lin) query qi is 
contained in a GN-SQL(lin) query q2 if and only if, for their 
GNFO translations 51 and gj, we have that ql A 6s j= gl, 
where S is the set of constants occurring in gi and 52 ■ □ 



A.12 Proof of Proposition |9.3| 

Proof. The upper bound follows from the ExpTime up- 
per bound for stratified Datalog 18 (obtained by the stan- 
dard technique of "grounding" a program by instantiating 
its rules via substituting domain elements for the variables 
in every possible way and solving the resulting exponentially 
large propositional Horn program with stratified negation by 
standard means). 

For the lower bound we provide a reduction from the ac- 
ceptance problem for polynomial-space alternating Turing 
machines. Consider an alternating Turing machine M using 
p{n) tape cells on any input of length n. We may assume 
w.l.o.g. that the states of M are partitioned into existen- 
tial states Qg and universal states Qv and that the transi- 
tion table of M consists of tuples (p, a, q, b, e, s, c, 5) inter- 
preted as follows. When in state p £ Q and reading a there 
are two possible transitions: writing b at the current posi- 
tion, entering state q, and moving the read-write head by 
e £ {— l,0,-)-l}; or writing c, entering state s and moving 
the head by S £ { — 1,0,+1}. The choice between the two 
possibilities is existential or universal according to whether 
p G Q3 or p G Qv In addition we may assume w.l.o.g. that 
M has a unique initial state init, and a unique accepting 
configuration with state ace, head position 1 and the used 
segment of its tape filled with O's. 

Let 03 be the structure with domain {0, 1} and the bi- 
nary relation Bits that holds the pair (0, 1) alone. Given 
an input word w of length |u;| = n and M as above we 
let A'^ = p(n) and we devise a GN-Datalog program with 
IDB predicates Sq^i{u\, . . . , um, z, o) of arity A*' -I- 2 for each 
q G Qs U Qv and 1 < i < A^. Intuitively speaking, every 
fact Sq.i{ui, . . . , UN, z, o) will encode a configuration of M 
in state q, head location i and tape contents ui . . . um, and 
z and o will invariably contain the values and 1 in ev- 
ery such fact ever derived. The GN-Datalog program I1m,„ 
simulating AI on an N = p(n)-bounded tape comprises the 



following rules. For every transition {p,ao,q,ai,e,s,a2,5) 
and every i such that 1 < i,i + e,i + 5 < N, if p G Qg then 
there are rules 

•S'p,i(ui, . . . ,Mi_l,(TO,Mi + l, . . . ,Mjv,2,o) 

<- 5g,i + j(Ml, . . . ,ni_l,(Tl,Ui + l, . . . ,um,z,o) . 

Sp,i{ui, . . . , Mi_l, CTQ, tii+1, . . . , U]^,Z, o) 

<— Ss^i^silJ-1, ■ ■ ■ ,"i-l,0"2,Ui + l, ■ . . ,Uj^,Z,o) . 

and if p G Qv then there is a rule 

Spj{ui, . . . ,Ui_i,CTo,Mi+i, . . . ,njv,2,o) 

<— Sq^i+f{ui, . . . ,Ui-i,ai,Ui+i, . . . ,um, 2,0), 

Ss,i+S(ui, . . ■ ,Mi_l,0-2,«i + l, . . . ,«JV,2,o) . 

where in both cases ct,- is < .„ "* , for each j = 0, 1, 2. 

■' (^ O if Oj = 1 •' ' ' 

In addition there is an acceptance rule 

Sacc,i{z, . ■ ■ , z,z,o) <— Bits{z,o) 



corresponding to the unique accepting configuration, and 
the answer rule 



Ansu 



Sinit,i{u-i, . . .,UN,z,o),Bits{z,o) 



encoding the initial configuration for a given input word 
w G {0,1}", where each Ui is one of the variables z ox o 
according to whether the ith bit of the initial tape contents 
with input w is zero or one. Then w is accepted by M if 
the GN-Datalog query (li M.\w\,AnSw) evaluates to true on 

23. n 



